# Generalizing the Pipeline

<div class="alert alert-block alert-info">
<b>Warning:</b>
This notebook depends on the Parquet files generated by the notebook <b>01 Preparing the Data</b>. Make sure to run all cells in that notebook before executing this one.
    
In particular, the files required are:
<ul>
    <li><tt>ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2016_segmented.parquet</tt></li>
    <li><tt>ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2017_segmented.parquet</tt></li>
</ul>
</div>

In [1]:
import numpy as np, pandas as pd
import time, bodo

## A First Attempt: specifying a toggle

There's one more important thing we've not yet considered: how to specify different input files to use as input. Keep in mind that the Bodo JIT compiler must statically type objects for compilation. That is, `bodo.jit` will do everything it can to type-infer dtypes in the process, if there is a potential conflict the compile logic will exit rather than attempt to force a compatibility. 

We can see this in the next example.  We'll set up `load_parking_tickets_toggle` to use a boolean input parameter as a toggle value to use in an `if/else` branch within the function. Perhaps an earlier piece of our data pipeline supplies this toggle value and we want this function to vary its behavior depending on that input.

In [2]:
@bodo.jit(spawn=True)
def load_parking_tickets_toggle(toggle):
    """
    Load data from file and aggregate by day, violation type, and police precinct.
    """

    start = time.time()
    if toggle:
        year_2016_df = pd.read_parquet('ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2016_segmented.parquet')
        year_2016_df = year_2016_df.groupby(['Issue Date','Violation County','Violation Precinct','Violation Code'], as_index=False)['Summons Number'].count()
        many_year_df = year_2016_df
    else:
        year_2017_df = pd.read_parquet('ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2017_segmented.parquet')
        year_2017_df = year_2017_df.groupby(['Issue Date','Violation County','Violation Precinct','Violation Code'], as_index=False)['Summons Number'].count()
        many_year_df = year_2017_df

    end = time.time()
    timing_str = f"\n{'Reading Time:':<42}{end - start:8.3f} sec"
    return many_year_df, timing_str

When we attempt to invoke `load_parking_tickets_toggle`, an exception is raised. We have deleted the lengthy output.

<div class="alert alert-block alert-info">
<b>Warning:</b>
Executing the next cell is expected to yield errors. You will see a lengthy stack trace if you attempt to execute the notebook from top to bottom (e.g., using the <tt>Run All Cells</tt> option from the <tt>Run</tt> menu).
</div>

In [3]:
load_parking_tickets_toggle(False)

TypingError: [1mFailed in bodo mode pipeline (step: <class 'bodo.transforms.typing_pass.BodoTypeInference'>)
[1m[1mCannot unify dataframe((Array(datetime64[ns], 1, 'C', False, aligned=True), DictionaryArrayType(StringArrayType()), Array(float64, 1, 'C', False, aligned=True), Array(int64, 1, 'C', False, aligned=True), Array(int64, 1, 'C', False, aligned=True)), RangeIndexType(none), ('Issue Date', 'Violation County', 'Violation Precinct', 'Violation Code', 'Summons Number'), 1D_Block_Var, True, False) and dataframe((Array(datetime64[ns], 1, 'C', False, aligned=True), DictionaryArrayType(StringArrayType()), Array(int64, 1, 'C', False, aligned=True), Array(int64, 1, 'C', False, aligned=True), Array(int64, 1, 'C', False, aligned=True)), RangeIndexType(none), ('Issue Date', 'Violation County', 'Violation Precinct', 'Violation Code', 'Summons Number'), 1D_Block_Var, True, False) for 'many_year_df.2', defined at /var/folders/w_/z_0_fn150v36jdgzrrlcj8q00000gn/T/ipykernel_1515/3144861981.py (17)
[1m
File "../../../../../../var/folders/w_/z_0_fn150v36jdgzrrlcj8q00000gn/T/ipykernel_1515/3144861981.py", line 17:[0m
[1m<source missing, REPL/exec in use?>[0m
[0m[0m[1mDuring: typing of assignment at /var/folders/w_/z_0_fn150v36jdgzrrlcj8q00000gn/T/ipykernel_1515/3144861981.py (17)[0m
[1m
File "../../../../../../var/folders/w_/z_0_fn150v36jdgzrrlcj8q00000gn/T/ipykernel_1515/3144861981.py", line 17:[0m
[1m<source missing, REPL/exec in use?>[0m
[0m

The important part of the lengthy stack trace produced looks like this:

<div class="alert alert-block alert-info">
<tt>TypingError: Cannot unify dataframe((array(datetime64[ns], 1d, C), StringArrayType(), array(int64, 1d, C), array(int64, 1d, C), array(int64, 1d, C)), RangeIndexType(none), ('Issue Date', 'Violation County', 'Violation Precinct', 'Violation Code', 'Summons Number'), 1D_Block_Var, False) and dataframe((array(datetime64[ns], 1d, C), StringArrayType(), array(float64, 1d, C), array(int64, 1d, C), array(int64, 1d, C)), RangeIndexType(none), ('Issue Date', 'Violation County', 'Violation Precinct', 'Violation Code', 'Summons Number'), 1D_Block_Var, False) for 'many_year_df.2'</tt>
</tt>
</div>

The problem is that the `'Violation Precinct'` column is inferred as dtype `int64` for the `2016` dataset, and as dtype `float64` for the `2017` dataset (likely due to missing entries cast as `NaN`).  As we assign either `year_2016_df` or `year_2017_df` to `many_year_df`, the compiler output is not *type-stable*, i.e., it does not have a single clear schema for `many_year_df`.

## A Second Approach: specifying a file path

We can, if absolutely neccessary, explicitly specify dtypes for the specific files; the process is described [here](https://docs.bodo.ai/2022.3/file_io/#non-constant-filepaths). A far easier path is to take the ambiguity out in the first place and let the compiler handle the request on a case-by-case basis. This is the approach we settle on here in `load_parking_tickets_file`.

In [4]:
@bodo.jit(spawn=True)
def load_parking_tickets_file(file):
    """
    Load data from specified file and aggregate by day, violation type, and police precinct.
    """
    start = time.time()
    year_df = pd.read_parquet(file)
    groupby_cols = ['Issue Date','Violation County','Violation Precinct','Violation Code']
    year_df = year_df.groupby(groupby_cols, as_index=False)['Summons Number'].count()

    end = time.time()
    timing_str = f"\n{'Reading Time:':<42}{end - start:8.3f} sec"
    return year_df, timing_str

This executes smoothly on the data from 2016.

In [5]:
# Try executing again with the 2016 file path as input
DATA_SRC = 'ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2016_segmented.parquet'
df, msg = load_parking_tickets_file(DATA_SRC)
print(df.shape)
print(msg)

(633247, 5)

Reading Time:                                0.368 sec




We can try again with another input file, say the data from 2017, as input.

In [6]:
# Try executing again with the 2017 file path as input
DATA_SRC = 'ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2017_segmented.parquet'
df, msg = load_parking_tickets_file(DATA_SRC)
print(df.shape)
print(msg)

(624388, 5)

Reading Time:                                0.506 sec


We could even tweak the function to have multiple arguments passed to the function if we need to concatenate multiple files together. Bodo also supports lists of Parquet files passed in as arguments—or requesting an entire folder be read in from CSV or Parquet formats.

Or perhaps, in keeping with the logic of the toggled function earlier, we can execute the toggling logic outside of a `bodo.jit`-compiled function and let the compiler handle specific cases without ambiguity.

In [7]:
## notice no jit decorator here
def run_load_by_toggle(toggle):
    if toggle:
        DATA_SRC = 'ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2016_segmented.parquet'
    else:
        DATA_SRC = 'ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2017_segmented.parquet'
    return load_parking_tickets_file(DATA_SRC)

In [8]:
result, msg = run_load_by_toggle(True)
display(result.head())
print(msg)

Unnamed: 0,Issue Date,Violation County,Violation Precinct,Violation Code,Summons Number
0,2015-07-09,K,74.0,46,1
1,2015-07-09,K,88.0,71,12
2,2015-07-09,K,94.0,71,14
3,2015-07-09,K,84.0,74,3
4,2015-07-09,K,88.0,20,15



Reading Time:                                0.242 sec


We now have most of what we need to build a flexible pipeline. Should we need it, Bodo also has the useful built-in `bodo.typeof` that can be used to determine what types are being inferred by the compiler, so we don't get into unexpected situations.

In [9]:
my_temp_df = load_parking_tickets_file('ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2016_segmented.parquet')
print(bodo.typeof(my_temp_df))

Tuple(dataframe((Array(datetime64[ns], 1, 'C', False, aligned=True), DictionaryArrayType(StringArrayType()), Array(float64, 1, 'C', False, aligned=True), Array(int64, 1, 'C', False, aligned=True), Array(int64, 1, 'C', False, aligned=True)), RangeIndexType(none), ('Issue Date', 'Violation County', 'Violation Precinct', 'Violation Code', 'Summons Number'), REP, True, False), unicode_type)
