# Bodo Getting Started Tutorial

In a nutshell, Bodo provides a just-in-time (JIT) compilation workflow using the `@bodo.jit` decorator.
It replaces decorated Python functions with an optimized and parallelized binary version under the hood.

In this tutorial, we will cover the basics of using Bodo and explain its important concepts. We strongly recommend reading this page before using Bodo.

Let's get started!

## Parallel Pandas with Bodo
First, we demonstrate how Bodo automatically parallelizes and optimizes standard Python programs that make use of pandas and NumPy, without the need to rewrite your code. Bodo can scale your analytics code to thousands of cores, providing orders of magnitude speed up depending on program characteristics.

In [1]:
import pandas as pd
pd.options.display.max_rows = 3

### Generate data
To begin, let's generate a simple dataset and write to a [Parquet](http://parquet.apache.org/) file:

In [2]:
import pandas as pd
import numpy as np

# 10m data points
df = pd.DataFrame(
    {
        "A": np.repeat(pd.date_range("2013-01-03", periods=1000), 10_000),
        "B": np.arange(10_000_000),
    }
)
# set some values to NA
df.iloc[np.arange(1000) * 3, 0] = pd.NA
# using row_group_size helps with efficient parallel read of data later
df.to_parquet("pd_example.pq", row_group_size=100000)
print(df)

                 A        B
0              NaT        0
...            ...      ...
9999999 2015-09-29  9999999

[10000000 rows x 2 columns]


### Example Code in Pandas
Here is a simple data transformation code in Pandas that processes a column of datetime values and creates two new columns:

In [3]:
import time
import pandas as pd

def data_transform():
    t0 = time.time()
    df = pd.read_parquet("pd_example.pq")
    df["B"] = df.apply(lambda r: "NA" if pd.isna(r.A) else "P1" if r.A.month < 5 else "P2", axis=1)
    df["C"] = df.A.dt.month
    print("Total time: {:.2f}".format(time.time()-t0))
    return df

data_transform()

Total time: 61.33


Unnamed: 0,A,B,C
0,NaT,,
...,...,...,...
9999999,2015-09-29,P2,9.0


Standard Python is quite slow for these data transforms since
1. The use of custom code inside apply() does not let Pandas run an optimized prebuilt C library in its backend. Therefore, the Python interpreter overheads dominate.
2. Python uses just a single CPU core and does not parallelize computation.

Bodo solves both of these problems as we demonstrate below.

### Using Bodo JIT Decorator
Bodo optimizes and parallelizes data workloads by providing just-in-time (JIT) compilation. To run the code with Bodo, all that we have to do is add the `bodo.jit` decorator to the function (`distributed=False` for now to use a single core).

In [6]:
import pandas as pd
import bodo
import time

@bodo.jit(distributed=False)
def data_transform():
    t0 = time.time()
    df = pd.read_parquet("pd_example.pq")
    df["B"] = df.apply(lambda r: "NA" if pd.isna(r.A) else "P1" if r.A.month < 5 else "P2", axis=1)
    df["C"] = df.A.dt.month
    print("Total time: {:.2f}".format(time.time()-t0))
    return df

data_transform()

Total time: 1.04


Unnamed: 0,A,B,C
0,NaT,,
...,...,...,...
9999999,2015-09-29,P2,9


Even though the code is still running on a single core, it is ~60x faster because Bodo compiles the function into a native binary, eliminating the interpreter overheads in apply.

Now let’s run the code on all available cores:

In [7]:
@bodo.jit
def data_transform():
    t0 = time.time()
    df = pd.read_parquet("pd_example.pq")
    df["B"] = df.apply(lambda r: "NA" if pd.isna(r.A) else "P1" if r.A.month < 5 else "P2", axis=1)
    df["C"] = df.A.dt.month
    print("Total time: {:.2f}".format(time.time()-t0))
    return df

data_transform()

Total time: 0.32


Unnamed: 0,A,B,C
0,NaT,,
...,...,...,...
9999999,2015-09-29,P2,9


Although the program appears to be a regular sequential Python program, Bodo compiles and *transforms* the decorated code (the `data_transform` function in this example) under the hood, so that it can run in parallel on many cores. Each core operates on a different chunk of the data and communicates with other cores when necessary. The speedup depends on the data and program characteristics, as well as the number of cores used. Usually, we can continue scaling to many more cores as long as the data is large enough.

### Compilation Time and Caching
Bodo’s JIT workflow compiles the function the first time it is called, but reuses the compiled version for subsequent calls. In the previous example, we added timers inside the function to avoid measuring compilation time. Let’s move the timers outside and call the function twice:

In [8]:
@bodo.jit
def data_transform():
    df = pd.read_parquet("pd_example.pq")
    df["B"] = df.apply(lambda r: "NA" if pd.isna(r.A) else "P1" if r.A.month < 5 else "P2", axis=1)
    df["C"] = df.A.dt.month
    df.to_parquet("bodo_output.pq")


t0 = time.time()
data_transform()
print("Total time first call: {:.2f}".format(time.time()-t0))
t0 = time.time()
data_transform()
print("Total time second call: {:.2f}".format(time.time()-t0))

Total time first call: 2.73
Total time second call: 1.42


The first call is slower due to compilation of the function, but the second call reuses the compiled version and runs faster. See [Caching](https://docs.bodo.ai/performance/caching/?h=caching) for more information.

### Parallel Python Processes
![Groupby shuffle communication pattern](img/groupby.jpg)

Bodo uses the MPI parallelism model, which allows cores to communicate efficiently without the overheads of driver-executor libraries.
When a Bodo JIT function is invoked for the first time, Bodo spawns parallel Python processes in the background using MPI, with each process managing a distinct portion of the data.

In [9]:
def load_data_pandas():
    df = pd.read_parquet("pd_example.pq")
    print("pandas dataframe: \n", df)

@bodo.jit
def load_data_bodo():
    df = pd.read_parquet("pd_example.pq")
    print("Bodo dataframe: \n", df)

load_data_pandas()
load_data_bodo()

pandas dataframe: 
                  A        B
0              NaT        0
...            ...      ...
9999999 2015-09-29  9999999

[10000000 rows x 2 columns]
Bodo dataframe: 
 Bodo dataframe: 
 Bodo dataframe: 
 Bodo dataframe: 
 Bodo dataframe: 
 Bodo dataframe: 
 Bodo dataframe: 
 Bodo dataframe: 
                  A        B
7500000 2015-01-23  7500000
7500001 2015-01-23  7500001
7500002 2015-01-23  7500002
7500003 2015-01-23  7500003
7500004 2015-01-23  7500004
...            ...      ...
8749995 2015-05-27  8749995
8749996 2015-05-27  8749996
8749997 2015-05-27  8749997
8749998 2015-05-27  8749998
8749999 2015-05-27  8749999

[1250000 rows x 2 columns]
                 A        B
3750000 2014-01-13  3750000
3750001 2014-01-13  3750001
3750002 2014-01-13  3750002
3750003 2014-01-13  3750003
3750004 2014-01-13  3750004
...            ...      ...
4999995 2014-05-17  4999995
4999996 2014-05-17  4999996
4999997 2014-05-17  4999997
4999998 2014-05-17  4999998
4999999 2014-05-17  499

The first dataframe printed is a regular Pandas dataframe and has all the 10 million rows.
However, the other dataframes printed are Bodo parallelized Pandas dataframes, each containing a portion of the data.
In this case, Bodo parallelizes read_parquet automatically and loads different chunks of data into different cores.
Therefore, the non-JIT parts of the Python program run sequentially whereas Bodo JIT functions are parallelized.
For more information on handling distributed data in python/JIT code, see [Handling distributed data](https://docs.bodo.ai/file_io/?h=data)

### Parallel Computation
Bodo automatically divides computation and manages communication across cores as this example demonstrates:

In [10]:
@bodo.jit
def data_groupby():
    df = pd.read_parquet("pd_example.pq")
    df2 = df.groupby("A", as_index=False).sum()
    print(df2)

This program uses groupby which requires rows with the same key to be aggregated together. Therefore, Bodo shuffles the data automatically under the hoods using MPI, and the user doesn’t need to worry about parallelism challenges like communication.

### Bodo JIT Requirements
Bodo JIT supports specific APIs in Pandas currently, and other APIs cannot be used inside JIT functions. For example:

In [10]:
@bodo.jit
def df_unsupported():
    df = pd.DataFrame({"A": [1, 2, 3]})
    df2 = df.transpose()
    return df2

df_unsupported()

BodoError: [1m[1m[1m[1m[1m[1m[1mDataFrame.transpose() not supported yet.[0m
[1m
File "../../../../../../var/folders/nb/s_7bnf052hg0lfqvbrw10bsw0000gn/T/ipykernel_27138/2559399185.py", line 4:[0m
[1m<source missing, REPL/exec in use?>[0m
[0m[0m[0m[0m[0m[0m

As the error indicates, Bodo doesn’t currently support the transpose call in JIT functions. In these cases, an alternative API should be used or this portion of the code should be done in regular Python. See [Pandas Operations](https://docs.bodo.ai/api_docs/pandas/general/#pdcrosstab) for the complete list of supported Pandas operations.

### Type Stability
The key requirement of JIT compilation is being able to infer data types for all variables and values. In Bodo, column names are part of dataframe data types, so Bodo tries to infer column name related inputs in all operations. For example, key names in groupby are used to determine the output data type and need to be known to Bodo:

In [12]:
@bodo.jit
def groupby_keys(extra_keys):
    df = pd.read_parquet("pd_example.pq")
    keys = [c for c in df.columns if c not in ["B", "C"]]
    if extra_keys:
        keys.append("B")
    df2 = df.groupby(keys).sum()
    print(df2)
    
groupby_keys(False)

BodoError: [1m[1m[1m[1m[1m[1mgroupby(): argument 'by' requires a constant value but variable 'keys' is updated inplace using 'append'
[1m
File "../../../../../../var/folders/nb/s_7bnf052hg0lfqvbrw10bsw0000gn/T/ipykernel_27138/4278964147.py", line 7:[0m
[1m<source missing, REPL/exec in use?>[0m
[0m[0m[0m[0m[0m[0m

In this case, the list of groupby keys is determined by a dynamic flag, and Bodo is not able to infer it from the program during compilation time. The alternative is to pass the keys as an argument to the JIT function to make the values known to Bodo:

In [11]:
@bodo.jit
def groupby_keys(keys):
    df = pd.read_parquet("pd_example.pq")
    df2 = df.groupby(keys).sum()
    print(df2)
    
groupby_keys(["A"])

                      B
A                      
2013-01-03     48496500
2013-01-12    949995000
2013-01-19   1649995000
2013-01-30   2749995000
2013-02-24   5249995000
...                 ...
2015-08-27  96649995000
2015-09-04  97449995000
2015-09-20  99049995000
2015-09-23  99349995000
2015-09-28  99849995000

[108 rows x 1 columns]
                      B
A                      
2013-01-04    149995000
2013-01-07    449995000
2013-03-09   6549995000
2013-03-14   7049995000
2013-03-17   7349995000
...                 ...
2015-09-07  97749995000
2015-09-11  98149995000
2015-09-14  98449995000
2015-09-16  98649995000
2015-09-29  99949995000

[125 rows x 1 columns]
                      B
A                      
2013-01-11    849995000
2013-01-15   1249995000
2013-01-21   1849995000
2013-01-22   1949995000
2013-01-24   2149995000
...                 ...
2015-08-24  96349995000
2015-08-28  96749995000
2015-09-10  98049995000
2015-09-18  98849995000
2015-09-24  99449995000

[119 rows x 1 c

This program works since `keys` is passed from regular Python to the JIT function.

For more information on out type stability requirement, see our [Documentation on compile time constants](https://docs.bodo.ai/bodo_parallelism/compile_time_constants/?h=compile+time+constan)

### Python Features

Bodo uses [Numba](https://numba.pydata.org/) for compiling regular Python features and some of Numba’s requirements apply to Bodo as well.
For example, values in data structures like lists should have the same data type. This example fails since list values are either integers or strings:

In [14]:
@bodo.jit
def create_list():
    out = []
    out.append(0)
    out.append("A")
    out.append(1)
    out.append("B")
    return out

create_list()

TypingError: [1mFailed in bodo mode pipeline (step: <class 'bodo.transforms.typing_pass.BodoTypeInference'>)
[1m[1m[1mInvalid use of BoundFunction(list.append for list(int64)<iv=None>) with parameters (Literal[str](A))
[0m[0m[1mDuring: resolving callee type: BoundFunction(list.append for list(int64)<iv=None>)[0m[0m[1mDuring: typing of call at /var/folders/nb/s_7bnf052hg0lfqvbrw10bsw0000gn/T/ipykernel_27138/3649282494.py (5)
[0m
[1m
File "../../../../../../var/folders/nb/s_7bnf052hg0lfqvbrw10bsw0000gn/T/ipykernel_27138/3649282494.py", line 5:[0m
[1m<source missing, REPL/exec in use?>[0m
[0m

Using tuples can often solve these problems since tuples can hold values of different types:

In [15]:
@bodo.jit
def create_list():
    out = []
    out.append((0, "A"))
    out.append((1, "B"))
    return out
create_list()

[(0, 'A'), (1, 'B')]

See [Unsupported Python Programs](https://docs.bodo.ai/2022.6/bodo_parallelism/not_supported/?h=bodo+pr) for more details.