# Bodo Extended Tutorial

This is a continuation of the getting started tutorial. You are encouraged to visit that tutorial first if you have not done so already. In this tutorial, we will explain core Bodo concepts in more detail, introduce additional Bodo features, and discuss more advanced topics.

### Parallel Execution Model

As we saw in the getting started tutorial, Bodo transforms functions for parallel execution. 
Parallel processes are spawned on the fly the first time a JIT function is invoked,
with each process managing a distinct portion of the data (this is called the Single Program Multiple Data ([SPMD](https://en.wikipedia.org/wiki/SPMD)) paradigm).


In [4]:
import numpy as np
import pandas as pd
# limit the number of rows printed to standard out to reduce clutter
pd.options.display.max_rows = 3
import bodo

@bodo.jit
def f(n, a):
    df = pd.DataFrame({"A": np.arange(n) + a})
    print(df)

f(16, 1)

   A
4  5
5  6
   A
6  7
7  8
     A
10  11
11  12
   A
2  3
3  4
    A
8   9
9  10
   A
0  1
1  2
     A
14  15
15  16
     A
12  13
13  14


### Parallel APIs

Bodo provides a limited number of parallel APIs to support advanced cases that may need them. The example below demonstrates getting the process number from Bodo (called `rank` in MPI terminology) and the total number of processes.

In [5]:
@bodo.jit
def f():
    # some work only on rank 0
    if bodo.get_rank() == 0:
        print("rank 0 done")
    
    # some work on every process
    print("rank", bodo.get_rank(), "here")
    print("total ranks:", bodo.get_size())
f()

rank 4 here
rank 6 here
rank 3 here
rank 2 here
rank 0 done
rank 0 here
total ranks: 8
rank 1 here
rank 5 here
rank 7 here


A common pattern is using barriers to make sure all processes see side-effects at the same time. For example, a process can delete files from storage while others wait before writing to file:

In [9]:
import shutil, os
import numpy as np

@bodo.wrap_python(bodo.types.none)
def delete_files():
    if os.path.exists("data/data.pq"):
        shutil.rmtree("data/data.pq")

@bodo.jit
def f(n):
    # remove file if exists
    if bodo.get_rank() == 0:
        delete_files()
    
    # make sure all processes are synchronized
    # (e.g. all processes need to see effect of rank 0's work)
    bodo.barrier()
    df = pd.DataFrame({"A": np.arange(n)})
    df.to_parquet("data/data.pq")

f(10)

The following figure illustrates what happens when processes call `bodo.barrier()`. When barrier is called, a process pauses and waits until all other processes have reached the barrier:

![Process synchronization with Barrier](img/barrier.svg)

<div class="alert alert-block alert-danger"
<b>Important:</b> The examples above show that it is possible to have each process follow a different control flow, but all processes must always call the same Bodo functions in the same order.
</div>

## Data Distribution

Bodo parallelizes computation by dividing data into separate chunks across processes. However, some data handled by a Bodo function may not be divided into chunks. There are are two main data distribution schemes:

- Replicated (*REP*): the data associated with the variable is the same on every process.
- One-dimensional (*1D*): the data is divided into chunks, split along one dimension (rows of a dataframe or first dimension of an array).

Bodo determines distribution of variables automatically, using the nature of the computation that produces them. Let's see an example:

In [13]:
@bodo.jit
def mean_power_speed():
    df = pd.read_parquet("data/cycling_dataset.pq")
    m = df[["power", "speed"]].mean()
    bodo.parallel_print(m)

mean_power_speed()

power    102.078421
speed      5.656851
dtype: float64
power    102.078421
speed      5.656851
dtype: float64
power    102.078421
speed      5.656851
dtype: float64
power    102.078421
speed      5.656851
dtype: float64
power    102.078421
speed      5.656851
dtype: float64
power    102.078421
speed      5.656851
dtype: float64
power    102.078421
speed      5.656851
dtype: float64
power    102.078421
speed      5.656851
dtype: float64


In this example, `df` is parallelized (each process reads a different chunk) but `m` is replicated, even though it is a Series. Semantically, it makes sense for the output of `mean` operation to be replicated on all processors, since it is a reduction and produces "small" data.

### Distributed Diagnostics

The distributions found by Bodo can be printed either by setting the `distributed_diagnostics` JIT flag or the
environment variable `BODO_DISTRIBUTED_DIAGNOSTICS=1`.
Let's examine the previous example's distributions:

In [3]:
@bodo.jit(distributed_diagnostics=True, spawn=False)
def mean_power_speed():
    df = pd.read_parquet("data/cycling_dataset.pq")
    m = df[["power", "speed"]].mean()
    bodo.parallel_print(m)

mean_power_speed()

Distributed diagnostics for function mean_power_speed, /var/folders/nb/s_7bnf052hg0lfqvbrw10bsw0000gn/T/ipykernel_62584/1953254433.py (1)

Data distributions:
   pq_table.823              1D_Block
   pq_index.824              1D_Block
   _v236call_19_985          1D_Block
   _v178call_15_862          1D_Block
   _v338call_29_874          1D_Block
   data_878                  REP
   _v76call_6_1160           REP
   _v156call_13_908          REP
   _v478call_44_890          REP
   table.1323                1D_Block

Parfor distributions:
   2                    1D_Block
   3                    1D_Block

Distributed listing for function mean_power_speed, /var/folders/nb/s_7bnf052hg0lfqvbrw10bsw0000gn/T/ipykernel_62584/1953254433.py (1)
--------------------------------------------------------| parfor_id/variable: distribution
@bodo.jit(distributed_diagnostics=True, spawn=False)    | 
def mean_power_speed():                                 | 
    df = pd.read_parquet("data/cycling_dataset.p

Variables are renamed due to optimization. The output shows that `power` and `speed` columns of `df` are distributed (`1D_Block`) but `m` is replicated (`REP`). This is because `df` is output of `read_parquet` and input of `mean`, both of which can be distributed by Bodo. `m` is output of `mean`, which is always replicated (available on every process).

### Function Arguments and Return Values

Now let's see what happens if we pass the data into the Bodo function as a function parameter:

In [1]:
@bodo.jit
def mean_power_speed(df):
    m = df[["power", "speed"]].mean()
    return m

df = pd.read_parquet("data/cycling_dataset.pq")
df.time = df.time.astype("datetime64[ns]")
res = mean_power_speed(df)
print(res)

power    102.078421
speed      5.656851
dtype: float64


The program runs and returns the same correct value as before. However, the input DataFrame is sent from the main Python process
to the parallel MPI processes which can introduce significant overheads. In addition,
parallel I/O and other I/O related optimizations such as filter pushdown and column pruning cannot be performed.
Therefore, reading data inside JIT functions is very important for best performance.

Bodo will attempt to return distributed data in most cases. However, the data is gathered in the
main process lazily only when necessary. `BodoDataFrame` is a DataFrame wrapper class that manages this lazy data gather.

In [3]:
import bodo
import pandas as pd

pd.options.display.max_columns = 7

@bodo.jit
def mean_power_speed():
    df = pd.read_parquet("data/cycling_dataset.pq")
    return df

df = mean_power_speed()
print(type(df))
print(df)

<class 'bodo.pandas.frame.BodoDataFrame'>
      Unnamed: 0    altitude  cadence  ...  power  speed                time
0              0  185.800003       51  ...     45  3.459 2016-10-20 22:01:26
...          ...         ...      ...  ...    ...    ...                 ...
3901        1131  178.399994        0  ...      0  2.853 2016-10-20 23:14:35

[3902 rows x 10 columns]


### Passing Distributed Data to Bodo

Bodo returned parallel dataframes can be passed across Bodo functions without the need for gathering data in the main process.

In [4]:
@bodo.jit
def read_data():
    df = pd.read_parquet("data/cycling_dataset.pq")
    print("total size", len(df))
    return df

@bodo.jit
def write_data(df):
    print("total size", len(df))
    df.to_parquet("data/cycling_dataset2.pq")

df = read_data()
write_data(df)

total size 3902
total size 3902


## Parallel I/O

![Bodo reads file chunks in parallel](img/file-read.jpg)

Efficient parallel data processing requires data I/O to be parallelized effectively as well.
Bodo provides parallel file I/O for many different formats such as [Parquet](http://parquet.apache.org),
[Iceberg](https://iceberg.apache.org/), Snowflake,
CSV, JSON, Numpy binaries, [HDF5](http://www.h5py.org) and SQL databases.
This diagram demonstrates how chunks of data are partitioned among parallel execution engines by Bodo.

### Parquet

Parquet is a commonly used file format in analytics due to its efficient columnar storage. Bodo supports the standard pandas API for reading Parquet:

In [5]:
@bodo.jit
def pq_read():
    df = pd.read_parquet("data/cycling_dataset.pq")
    print(df)

pq_read()

      Unnamed: 0    altitude  cadence  ...  power  speed                time
1464          69  125.599998       82  ...      9  6.890 2016-10-20 22:32:07
1465          70  125.800003      144  ...      9  6.890 2016-10-20 22:32:08
1466          71  126.000000        0  ...      4  6.590 2016-10-20 22:32:09
1467          72  125.800003        0  ...      2  6.334 2016-10-20 22:32:10
1468          73  125.800003        0  ...     11  6.306 2016-10-20 22:32:11
...          ...         ...      ...  ...    ...    ...                 ...
1947         552  147.000000       72  ...    135  6.356 2016-10-20 22:41:29
1948         553  147.399994       99  ...    153  6.452 2016-10-20 22:41:30
1949         554  147.600006       85  ...    117  6.550 2016-10-20 22:41:31
1950         555  147.600006       78  ...    129  6.581 2016-10-20 22:41:32
1951         556  147.600006       78  ...    121  6.621 2016-10-20 22:41:33

[488 rows x 10 columns]
      Unnamed: 0    altitude  cadence  ...  power  

Bodo also supports the pandas API for writing Parquet files:

In [6]:
@bodo.jit
def generate_data_and_write():
    df = pd.DataFrame({"A": np.arange(80)})
    df.to_parquet("data/pq_output.pq")

generate_data_and_write()

<div class="alert alert-block alert-info"
<b>Note:</b> Bodo writes a directory of parquet files (one file per process) when writing distributed data. Bodo writes a single file when the data is replicated.
</div>

In this example, `df` is distributed data so it is written to a directory a parquet files.

Bodo supports parallel read of single Parquet files, as well as directory of files:

In [7]:
@bodo.jit
def read_parquet_dir():
    df = pd.read_parquet("data/pq_output.pq")
    print(df)

read_parquet_dir()

     A
40  40
41  41
42  42
43  43
44  44
45  45
46  46
47  47
48  48
49  49
     A
50  50
51  51
52  52
53  53
54  54
55  55
56  56
57  57
58  58
59  59
     A
70  70
71  71
72  72
73  73
74  74
75  75
76  76
77  77
78  78
79  79
     A
10  10
11  11
12  12
13  13
14  14
15  15
16  16
17  17
18  18
19  19
     A
20  20
21  21
22  22
23  23
24  24
25  25
26  26
27  27
28  28
29  29
     A
30  30
31  31
32  32
33  33
34  34
35  35
36  36
37  37
38  38
39  39
     A
60  60
61  61
62  62
63  63
64  64
65  65
66  66
67  67
68  68
69  69
   A
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8
9  9


### CSV
CSV is a common text format for data exchange. Bodo supports the standard pandas API to read CSV files:

In [8]:
@bodo.jit()
def csv_example():
    df = pd.read_csv("data/cycling_dataset.csv", header=None)
    print(df)

csv_example()

         0    1           2    3  ...          7    8      9                   10
1464  1464   69  125.599998   82  ... -97.764937    9   6.89  2016-10-20 22:32:07
1465  1465   70  125.800003  144  ... -97.764988    9   6.89  2016-10-20 22:32:08
1466  1466   71       126.0    0  ... -97.765038    4   6.59  2016-10-20 22:32:09
1467  1467   72  125.800003    0  ... -97.765083    2  6.334  2016-10-20 22:32:10
1468  1468   73  125.800003    0  ... -97.765132   11  6.306  2016-10-20 22:32:11
...    ...  ...         ...  ...  ...        ...  ...    ...                  ...
1947  1947  552       147.0   72  ... -97.782195  135  6.356  2016-10-20 22:41:29
1948  1948  553  147.399994   99  ... -97.782233  153  6.452  2016-10-20 22:41:30
1949  1949  554  147.600006   85  ...  -97.78227  117   6.55  2016-10-20 22:41:31
1950  1950  555  147.600006   78  ... -97.782309  129  6.581  2016-10-20 22:41:32
1951  1951  556  147.600006   78  ... -97.782343  121  6.621  2016-10-20 22:41:33

[488 rows x 11 

In addition to the pandas `read_csv()` functionality, Bodo can also read a directory containing multiple CSV files (all part of the same dataframe).

<div class="alert alert-block alert-info"
<b>Note:</b>

When writing distributed data to CSV:

- To S3 or HDFS: Bodo writes to a directory of CSV files (one file per process)
- To POSIX filesystem (e.g. local filesystem on Linux): Bodo writes the distributed data in parallel to a single file.

If the data is replicated, Bodo always writes to a single file.

</div>

### HDF5
HDF5 is a common format in scientific computing and AI, especially for multi-dimensional numerical data. HDF5 can be very efficient at scale, since it has native parallel I/O support. Bodo supports the standard h5py APIs:

In [9]:
import h5py

@bodo.jit
def example_h5():
    f = h5py.File("data/data.h5", "r")
    return f["A"][:].sum()

res = example_h5()
print(res)

66


### Numpy Binary Files
Bodo supports reading and writing binary files using Numpy APIs as well.

In [10]:
@bodo.jit
def example_np_io():
    A = np.fromfile("data/data.dat", np.int64)
    return A.sum()

res = example_np_io()
print(res)

45


### Type Annotation (when file name is unknown at compile time)

Bodo needs to know or infer the types for all data, but this is not always possible for input from files if file name is not known at compilation time.

For example, suppose we have the following files:

In [11]:
import pandas as pd
import numpy as np

def generate_files(n):
    for i in range(n):
        df = pd.DataFrame({"A": np.arange(5, dtype=np.int64)})
        df.to_parquet("data/test" + str(i) + ".pq")

generate_files(5)

And we want to read them like this:

In [12]:
import pandas as pd
import numpy as np
import bodo

@bodo.jit
def read_data(n):
    x = 0
    for i in range(n):
        file_name = "data/test" + str(i) + ".pq"
        df = pd.read_parquet(file_name)
        print(df)
        x += df["A"].sum()
    return x

result = read_data(5)
# BodoError: Parquet schema not available. Either path argument should be
# constant for Bodo to look at the file at compile time or schema should be provided.

BodoError: [1m[1m[1m[1m[1m[1mParquet schema not available. Either path argument should be constant for Bodo to look at the file at compile time or schema should be provided. For more information, see: https://docs.bodo.ai/latest/file_io/#parquet-section.[0m
[1m
File "../../../../../../var/folders/nb/s_7bnf052hg0lfqvbrw10bsw0000gn/T/ipykernel_67841/4178619562.py", line 9:[0m
[1m<source missing, REPL/exec in use?>[0m
[0m[0m[0m[0m[0m

The file names are computed at runtime, which doesn't allow the compiler to find the files and extract the schemas. As shown below, the solution is to use *type annotation* to provide data types to the compiler.

#### Type annotation for Parquet files

Example below uses the `locals` option of the decorator to provide the compiler with the schema of the local variable `df`:

In [14]:
@bodo.jit(locals={"df": {"A": bodo.types.int64[:]}})
def read_data(n):
    x = 0
    for i in range(n):
        file_name = "data/test" + str(i) + ".pq"
        df = pd.read_parquet(file_name)
        x += df["A"].sum()
    return x

result = read_data(5)
print(result)

50




#### Type annotation for CSV files

For CSV files, we can annotate types in the same way as pandas:

In [15]:
def generate_files(n):
    for i in range(n):
        df = pd.DataFrame({"A": np.arange(5, dtype=np.int64)})
        df.to_csv("data/test" + str(i) + ".csv", index=False)

@bodo.jit
def read_data(n):
    coltypes = {"A": np.int64}
    x = 0
    for i in range(n):
        file_name = "data/test" + str(i) + ".csv"
        df = pd.read_csv(file_name, names=coltypes.keys(), dtype=coltypes, header=0)
        x += df["A"].sum()
    return x

n = 5
generate_files(n)
result = read_data(n)
print(result)

50


## Bodo Caching

In many situations, Bodo can save the binary resulting from the compilation of a function to disk, to be reused in future runs. This avoids the need to recompile functions the next time that you run your application.

As we explained earlier, recompiling a function is only necessary when it is called with new input types, and the same applies to caching. In other words, an application can be run multiple times and process different data without having to recompile any code if the data types remain the same (which is the most common situation).

<div class="alert alert-block alert-warning"
<b>Warning:</b> Caching works in many (but not all) situations, and is disabled by default. See caching limitations below for more information.
</div>

### Caching Example

To cache a function, we only need to add the option `cache=True` to the JIT decorator:

In [16]:
import time

@bodo.jit(cache=True)
def mean_power_speed():
    df = pd.read_parquet("data/cycling_dataset.pq")
    return df[["power", "speed"]].mean()

t0 = time.time()
result = mean_power_speed()
print(result)
print("Total execution time:", round(time.time() - t0, 3), "secs")

power    102.078421
speed      5.656851
dtype: float64
Total execution time: 1.35 secs


The first time that the above code runs, Bodo compiles the function and caches it to disk. In subsequent runs, it will recover the function from cache and the execution time will be much faster as a result. You can try this out by running the above code multiple times, and changing between `cache=True` and `cache=False`.

### Cache Location and Portability

In most cases, the cache is saved in the `__pycache__` directory inside the directory where the source files are located.

On Jupyter notebooks, the cache directory is called ``numba_cache`` and is located in ``IPython.paths.get_ipython_cache_dir()``. See [here](http://numba.pydata.org/numba-doc/latest/reference/envvars.html?#envvar-NUMBA_CACHE_DIR) for more information on these and other alternate cache locations. For example, when running in a notebook:

In [17]:
import os
import IPython

cache_dir = IPython.paths.get_ipython_cache_dir() + "/numba_cache"
print("Cache files:")
os.listdir(cache_dir)

Cache files:


['575913040.mean_power_speed-b2cd27ff54.py312bodo2024.11.2.dev18+ge776d57a2c.d20241201-af6917c9bc5148f3345e1483e0d8d05e.nbi',
 '575913040.mean_power_speed-b2cd27ff54.py312bodo2024.11.2.dev18+ge776d57a2c.d20241201-af6917c9bc5148f3345e1483e0d8d05e.1.nbc']

Cached objects work across systems with the same CPU model and CPU features. Therefore, it is safe to share and reuse the contents in the cache directory on a different machine. See [here](http://numba.pydata.org/numba-doc/latest/developer/caching.html#cache-sharing) for more information.

### Cache Invalidation

The cache is invalidated automatically when the corresponding source code is modified. One way to observe this behavior is to modify the above example after it has been cached a first time, by changing the name of the variable `df`. The next time that we run the code, Bodo will determine that the source code has been modified, invalidate the cache and recompile the function.

<div class="alert alert-block alert-warning"
<b>Warning:</b> It is sometimes necessary to clear the cache manually (see caching limitations below). To clear the cache, the cache files can simply be removed.
</div>

### Current Caching Limitations

- Changes in compiled functions are not seen across files. For example, if we have a cached Bodo function that calls a cached Bodo function in a different file, and modify the latter, Bodo will not update its cache (and therefore run with the old version of the function).
- Global variables are treated as compile-time constants. When a function is compiled, the value of any globals that the function uses are embedded in the binary at compilation time and remain constant. If the value of the global changes in the source code after compilation, the compiled object (and cache) will not rebind to the new value.

## Advanced Features

### Explicit Parallel Loops
Sometimes explicit parallel loops are required since a program cannot be written in terms of data-parallel operators easily. In this case, one can use Bodo’s `prange` in place of `range` to specify that a loop can be parallelized. The user is required to make sure the loop does not have cross iteration dependencies except for supported reductions.

The example below demonstrates a parallel loop with a reduction:

In [18]:
import bodo
from bodo import prange
import numpy as np

@bodo.jit
def prange_test(n):
    A = np.random.ranf(n)
    s = 0
    B = np.empty(n)
    for i in prange(len(A)):
        bodo.parallel_print("rank", bodo.get_rank())
        # A[i]: distributed data access with loop index
        # s: a supported sum reduction
        s += A[i]
        # write array with loop index
        B[i] = 2 * A[i]
    return s + B.sum()

res = prange_test(10)
print(res)

rank 7
rank 6
rank 3
rank 5
rank 2
rank 1
rank 1
rank 0
rank 0
rank 4
15.960590511716582


Currently, reductions using +=, *=, min, and max operators are supported. Iterations are simply divided between processes and executed in parallel, but reductions are handled using data exchange.

### Collections of Distributed Data
List and dictionary collections can be used to hold distributed data structures:

In [23]:
@bodo.jit()
def f():
    to_concat = []
    for i in range(10):
        to_concat.append(pd.DataFrame({'A': np.arange(100), 'B': np.random.random(100)}))
    df = pd.concat(to_concat)
    return df

f()

Unnamed: 0,A,B
0,0,0.119124
...,...,...
99,99,0.713300


## Troubleshooting

### Compilation Tips

The general recommendation is to **compile the code that is performance critical and/or requires scaling**.

1. Don’t use Bodo for scripts that set up infrastucture or do initializations.
2. Only use Bodo for data processing and analytics code.

This reduces the risk of hitting unsupported features and reduces compilation time. To do so, simply factor out the code that needs to be compiled by Bodo and pass data into Bodo compiled functions.

### Compilation Errors

The most common reason is that the code relies on features that Bodo currently does not support, so it’s important to understand the limitations of Bodo. There are 4 main limitations:

1. Not supported Pandas API (see [here](https://docs.bodo.ai/2022.6/api_docs/pandas/))
2. Not supported NumPy API (see [here](https://docs.bodo.ai/2022.6/api_docs/numpy/))
3. Not supported Python features or datatypes (see [here](https://docs.bodo.ai/2022.6/bodo_parallelism/not_supported/))
4. Not supported Python programs due to type instability

Solutions:

1. Make sure your code works in Python (using a small sample dataset): a lot of the times a Bodo decorated function doesn’t compile, but it does not work in Python either.
2. Replace unsupported operations with supported operations if possible.
3. Refactor the code to partially use regular Python, explained in "Integration with non-Bodo APIs" section.

For example, the code below uses heterogenous list values inside `a` which cannot be typed:

In [24]:
@bodo.jit
def f(n):
    a = [[-1, "a"]]
    for i in range(n):
        a.append([i, "a"])
    return a

print(f(3))

BodoError: [1m[1m[1m[1m[1m[1m[1m[1m[0m[0m[0m[0m[0m[0m[0m[0m

However, this use case can be rewritten to use tuple values instead of lists since values don't change:

In [25]:
@bodo.jit
def f(n):
    a = [(-1, "a")]
    for i in range(n):
        a.append((i, "a"))
    return a

print(f(3))

[(-1, 'a'), (0, 'a'), (1, 'a'), (2, 'a')]


### DataFrame Schema Stability

Deterministic dataframe schemas (column names and types), which are required in most data systems, are key for type stability. For example, variable `df` in example below could be either a single column dataframe or a two column one – Bodo cannot determine it at compilation time:

In [26]:
@bodo.jit
def f(a, n):
    df = pd.DataFrame({"A": np.arange(n)})
    df2 = pd.DataFrame({"A": np.arange(n) ** 2, "C": np.ones(n)})
    if len(a) > 3:
        df = df.merge(df2)

    return df.mean()

print(f([2, 3], 10))
# TypeError: Cannot unify dataframe((array(int64, 1d, C),), RangeIndexType(none), ('A',), False)
# and dataframe((array(int64, 1d, C), array(int64, 1d, C)), RangeIndexType(none), ('A', 'C'), False) for 'df'

TypingError: [1m[1m[1mCannot unify dataframe((Array(int64, 1, 'C', False, aligned=True),), RangeIndexType(none), ('A',), 1D_Block_Var, False, False) and dataframe((Array(int64, 1, 'C', False, aligned=True), Array(float64, 1, 'C', False, aligned=True)), RangeIndexType(none), ('A', 'C'), 1D_Block_Var, True, False) for 'df.2', defined at /var/folders/nb/s_7bnf052hg0lfqvbrw10bsw0000gn/T/ipykernel_67841/1453791955.py (8)
[1m
File "../../../../../../var/folders/nb/s_7bnf052hg0lfqvbrw10bsw0000gn/T/ipykernel_67841/1453791955.py", line 8:[0m
[1m<source missing, REPL/exec in use?>[0m
[0m[0m
[1m
File "../../../../../../var/folders/nb/s_7bnf052hg0lfqvbrw10bsw0000gn/T/ipykernel_67841/1453791955.py", line 8:[0m
[1m<source missing, REPL/exec in use?>[0m

[1m
File "../../../../../../var/folders/nb/s_7bnf052hg0lfqvbrw10bsw0000gn/T/ipykernel_67841/1453791955.py", line 8:[0m
[1m<source missing, REPL/exec in use?>[0m
[0m

The error message means that Bodo cannot find a type that can unify the two types into a single type. This code can be refactored so that the if control flow is executed in regular Python context, but the rest of computation is in Bodo functions. For example, one could use two versions of the function:

In [27]:
@bodo.jit
def f1(n):
    df = pd.DataFrame({"A": np.arange(n)})
    return df.mean()

@bodo.jit
def f2(n):
    df = pd.DataFrame({"A": np.arange(n)})
    df2 = pd.DataFrame({"A": np.arange(n) ** 2, "C": np.ones(n)})
    df = df.merge(df2)
    return df.mean()

a = [2, 3]
if len(a) > 3:
    print(f1(10))
else:
    print(f2(10))

A    3.5
C    1.0
dtype: float64


Another common place where schema stability may be compromised is in passing non-constant list of key column names to dataframe operations such as `groupby`, `merge` and `sort_values`. In these operations, Bodo should be able to deduce the list of key column names at compile time in order to determine the output dataframe schema. For example, the program below is potentially type unstable since Bodo may not be able to infer `column_list` during compilation:

In [30]:
@bodo.jit
def f(a, n, flag):
    # some computation that cannot be inferred statically
    column_list = a
    if flag:
        column_list = ["A"]
    df = pd.DataFrame({"A": np.arange(n), "B": np.ones(n)})
    return df.groupby(column_list).sum()

a = ["A", "B"]
f(a, 10, True)
# BodoError: groupby(): 'by' parameter only supports a constant column label or column labels.

BodoError: [1m[1m[1m[1m[1m[1m[1mgroupby(): 'by' parameter only supports a constant column label or column labels, not reflected list(unicode_type)<iv=None>.[0m
[1m
File "../../../../../../var/folders/nb/s_7bnf052hg0lfqvbrw10bsw0000gn/T/ipykernel_67841/3424421760.py", line 8:[0m
[1m<source missing, REPL/exec in use?>[0m
[0m[0m[0m[0m[0m[0m

The code can most often be refactored to compute the key list in regular Python and pass as argument to Bodo:

In [31]:
@bodo.jit
def f(column_list, n):
    df = pd.DataFrame({"A": np.arange(n), "B": np.ones(n)})
    return df.groupby(column_list).sum()

flag = True
column_list = ["A", "B"]
if flag:
    column_list = ["A"]
f(column_list, 10)

Unnamed: 0_level_0,B
A,Unnamed: 1_level_1
9,1.0
...,...
6,1.0


## Nullable Integers in Pandas

DataFrame and Series objects with integer data need special care due to [integer NA issues in Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions). By default, Pandas dynamically converts integer columns to floating point when missing values (NAs) are needed, which can result in loss of precision as well as type instability.

Pandas introduced [a new nullable integer data type](https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html#integer-na) that can solve this issue, which is also supported by Bodo. For example, this code reads column A into a nullable integer array (the capital “I” denotes nullable integer type):

In [32]:
data = (
    "11,1.2\n"
    "-2,\n"
    ",3.1\n"
    "4,-0.1\n"
)

with open("data/data.csv", "w") as f:
    f.write(data)


@bodo.jit()
def f():
    dtype = {"A": "Int64", "B": "float64"}
    df = pd.read_csv("data/data.csv", dtype = dtype, names = dtype.keys())
    return df

f()

Unnamed: 0,A,B
0,11,1.2
...,...,...
3,4,-0.1


## Checking NA Values

When an operation iterates over the values in a Series or Array, type stablity requires special handling for NAs using `pd.isna()`. For example, `Series.map()` applies an operation to each element in the series and failing to check for NAs can result in garbage values propagating.

In [33]:
S = pd.Series(pd.array([1, None, None, 3, 10], dtype="Int8"))

@bodo.jit
def map_copy(S):
    return S.map(lambda a: a if not pd.isna(a) else None)

print(map_copy(S))

0    1
    ..
4    1
Length: 5, dtype: int8[pyarrow]


### Boxing/Unboxing Overheads

Bodo uses efficient native data structures which can be different than Python. When Python values are passed to Bodo, they are *unboxed* to native representation. On the other hand, returning Bodo values requires *boxing* to Python objects. Boxing and unboxing can have significant overhead depending on size and type of data. For example, passing `date` columns between Python/Bodo repeatedly can be expensive:

In [36]:
@bodo.jit()
def get_data():
    df = pd.read_parquet("data/cycling_dataset.pq")
    return df["time"].dt.date

@bodo.jit()
def get_day(S):
    return S.map(lambda d: d.day)

S = get_data()
print(S)
res = get_day(S)
print(res)

0       2016-10-20
           ...    
3901    2016-10-20
Name: time, Length: 3902, dtype: object
0       20
        ..
3901    20
Name: time, Length: 3902, dtype: int64


One can try to keep data in Bodo functions as much as possible to avoid boxing/unboxing overheads:

In [37]:
@bodo.jit()
def get_data():
    df = pd.read_parquet("data/cycling_dataset.pq")
    return df["time"].dt.date

@bodo.jit()
def get_day(S):
    return S.map(lambda d: d.day)

@bodo.jit
def f():
    S = get_data()
    print(S)
    res = get_day(S)
    print(res)

f()

0      2016-10-20
1      2016-10-20
2      2016-10-20
3      2016-10-20
4      2016-10-20
          ...    
483    2016-10-20
484    2016-10-20
485    2016-10-20
486    2016-10-20
487    2016-10-20
Name: time, Length: 488, dtype: object
1464    2016-10-20
1465    2016-10-20
1466    2016-10-20
1467    2016-10-20
1468    2016-10-20
           ...    
1947    2016-10-20
1948    2016-10-20
1949    2016-10-20
1950    2016-10-20
1951    2016-10-20
Name: time, Length: 488, dtype: object
2928    2016-10-20
2929    2016-10-20
2930    2016-10-20
2931    2016-10-20
2932    2016-10-20
           ...    
3410    2016-10-20
3411    2016-10-20
3412    2016-10-20
3413    2016-10-20
3414    2016-10-20
Name: time, Length: 487, dtype: object
3415    2016-10-20
3416    2016-10-20
3417    2016-10-20
3418    2016-10-20
3419    2016-10-20
           ...    
3897    2016-10-20
3898    2016-10-20
3899    2016-10-20
3900    2016-10-20
3901    2016-10-20
Name: time, Length: 487, dtype: object
2440    2016-10-20


### Iterating Over Columns

Iterating over columns in a dataframe can cause type stability issues, since column types in each iteration can be different. Bodo supports this usage for many practical cases by automatically unrolling loops over dataframe columns when possible. For example, the example below computes the sum of all data frame columns:

In [38]:
@bodo.jit
def f():
    n = 20
    df = pd.DataFrame({"A": np.arange(n), "B": np.arange(n) ** 2, "C": np.ones(n)})
    s = 0
    for c in df.columns:
     s += df[c].sum()
    return s

f()

2680.0

For automatic unrolling, the loop needs to be a `for` loop over column names that can be determined by Bodo at compile time.

## Regular Expressions using `re`

Bodo supports string processing using Pandas and the `re` standard package, offering significant flexibility for string processing applications. For example:

In [39]:
import re

@bodo.jit
def f(S):
    def g(a):
        res = 0
        if re.search(".*AB.*", a):
            res = 3
        if re.search(".*23.*", a):
            res = 5
        return res

    return S.map(g)

S = pd.Series(["AABCDE", "BBABCE", "1234"])
f(S)

0    3
1    3
2    5
dtype: int64