# Constructing a Pipeline of Functions

<div class="alert alert-block alert-info">
<b>Warning:</b>
This notebook depends on the Parquet files generated by the notebook <b>01 Preparing the Data</b>. Make sure to run all cells in that notebook before executing this one.
    
In particular, the files required are:
<ul>
    <li><tt>ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2016_segmented.parquet</tt></li>
    <li><tt>ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2017_segmented.parquet</tt></li>
</ul>
</div>

This notebook follows the analysis set up in the notebooks **01 Preparing the Data** and **02 Comparing Data Sources**. The pipeline we lay out here was originally developed in a notebook from the [Bodo-examples](https://github.com/Bodo-inc/Bodo-examples) GitHub repository. [That notebook](https://github.com/Bodo-inc/Bodo-examples/blob/master/notebooks/parking/nyc-parking-tickets.ipynb) provides deeper discussion of the individual functions.

Our purpose here is to emphasize use patterns and how those impact run times. It's enough to know here that we are taking a function similar to `load_parking_tickets` from the last notebook and then executing a sequence of transformations on the resulting DataFrame.

In a workflow you might see used in data exploration with a Jupyter notebook, this notebook defines functions and then executes them to compute results. The results of intermediate computations are returned from function calls as objects in the Python interpreter namespace. The objects can be inspected there before continuing to the next portion of the analysis.

In [1]:
import pandas as pd, numpy as np
import bodo, time

## Setting up the Pipeline

The following sequence of cells define functions to use in our pipeline.

+ Notice that the functions all return a string with timing information as well as one or more DataFrames. This is strictly for convenient diagnostics.
+ Remember, the code in all these cells execute on all engines by virtue of the `%%px` cell magic.
+ The cell immediately following each function definition checks that the function works as intended. This is not rigorous testing, but it is a typical development workflow in building data pipelines.
+ The definition of the function `load_violation_precincts_codes` is decorated using `@bodo.jit(distributed=False)`. Using the option `distributed=False` means that these small DataFrames are *replicated* onto all engines. For small data like this that is needed on all engines, replicating the data is not punitive and is in fact useful.

In [2]:
@bodo.jit
def load_parking_tickets():
    """
    Load data and aggregate by day, violation type, and police precinct.
    """
    start = time.time()
    groupby_cols = ['Issue Date','Violation County','Violation Precinct','Violation Code']
    
    DATA_SRC = 'ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2016_segmented.parquet'
    year_2016_df = pd.read_parquet(DATA_SRC)
    year_2016_df = year_2016_df.groupby(groupby_cols, as_index=False)['Summons Number'].count()

    DATA_SRC = 'ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2017_segmented.parquet'
    year_2017_df = pd.read_parquet(DATA_SRC)
    year_2017_df = year_2017_df.groupby(groupby_cols, as_index=False)['Summons Number'].count()
    
    # concatenate all dataframes into one dataframe
    many_year_df = pd.concat([year_2016_df, year_2017_df])
    end = time.time()
    timing_str = f"\n{'Reading Time:':<40}{end - start:8.3f} sec"
    return many_year_df, timing_str

In [3]:
# Verify that load_parking_tickets works as intended
main_df, output = load_parking_tickets()

# Examine output from load_parking_tickets to ensure it makes sense
display(main_df.head())
print(output)

    conda install openjdk=11 -c conda-forge
and then reactivate your environment via
    conda deactivate && conda activate /Users/scottroutledge/miniforge3


Unnamed: 0,Issue Date,Violation County,Violation Precinct,Violation Code,Summons Number
0,2015-07-09,K,74.0,46,1
1,2015-07-09,K,88.0,71,12
2,2015-07-09,K,94.0,71,14
3,2015-07-09,K,84.0,74,3
4,2015-07-09,K,88.0,20,15



Reading Time:                              1.164 sec


In [6]:
@bodo.jit(distributed=False)
def load_violation_precincts_codes():
    """
    Load violation codes and precincts information.
    """
    start = time.time()
    violation_codes = pd.read_csv("ParkingData/DOF_Parking_Violation_Codes.csv")
    violation_codes.columns = ['Violation Code','Definition','manhattan_96_and_below','all_other_areas']
    nyc_precincts_df = pd.read_csv("ParkingData/nyc_precincts.csv", index_col='index')
    end = time.time()
    timing_str = f"\n{'Violation and precincts load Time:':<40}{end - start:8.3f} sec"
    return violation_codes, nyc_precincts_df, timing_str

In [8]:
# Verify that load_violation_precincts_codes works as intended
violation_codes, nyc_precincts_df, output = load_violation_precincts_codes()
display(violation_codes.head())
display(nyc_precincts_df.head())
print(output)



Unnamed: 0,Violation Code,Definition,manhattan_96_and_below,all_other_areas
0,10,"Stopping, standing or parking where a sign, st...",115,115
1,11,Hotel Loading/Unloading: Standing or parking w...,115,115
2,12,Snow Emergency: Standing or parking where stan...,95,95
3,13,Taxi Stand: Standing or parking where standing...,115,115
4,14,General No Standing: Standing or parking where...,115,115


Unnamed: 0_level_0,Violation Precinct
index,Unnamed: 1_level_1
0,1
1,5
2,6
3,7
4,9



Violation and precincts load Time:         0.937 sec


In [9]:
@bodo.jit
def elim_code_36(main_df):
    """
    Remove undefined violations (code 36)
    """
    start = time.time()
    main_df = main_df[main_df['Violation Code']!=36].sort_values('Summons Number',ascending=False)
    end = time.time()
    timing_str = f"\n{'Eliminate undefined violations time:':<40}{end - start:8.3f} sec"
    return main_df, timing_str

In [10]:
# Verify that elim_code_36 works as intended
main_df, output = elim_code_36(main_df)
# Examine output from elim_code_36 to ensure it makes sense
display(main_df.head())
print(output)



Unnamed: 0,Issue Date,Violation County,Violation Precinct,Violation Code,Summons Number
255375,2015-07-11,,0.0,7,2365
508205,2015-06-13,,0.0,7,2327
261759,2015-07-25,,0.0,7,2285
65383,2015-06-24,,0.0,7,2204
191978,2015-06-14,,0.0,7,2182



Eliminate undefined violations time:       0.951 sec


In [11]:
@bodo.jit
def remove_outliers(main_df):
    """
    Delete entries that have dates outside our dataset dates
    """
    start = time.time()
    main_df = main_df[(main_df['Issue Date'] >= '2016-01-01') & (main_df['Issue Date'] <= '2017-12-31')]
    end = time.time()
    timing_str = f"\n{'Remove outliers time:':<40}{end - start:8.3f} sec"
    return main_df, timing_str

In [12]:
# Verify that remove_outliers works as intended
main_df, output = remove_outliers(main_df)
# Examine output from remove_outliers to ensure it makes sense
display(main_df.head())
print(output)



Unnamed: 0,Issue Date,Violation County,Violation Precinct,Violation Code,Summons Number
492823,2016-04-17,,0.0,7,1661
49830,2016-04-16,,0.0,7,1603
619399,2016-04-10,,0.0,7,1589
117898,2016-04-23,,0.0,7,1586
613649,2016-03-12,,0.0,7,1545



Remove outliers time:                      0.010 sec


In [30]:
@bodo.jit(replicated=["violation_codes"])
def merge_violation_code(main_df, violation_codes):
    """
    Merge violation information in the main_df
    """
    start = time.time()
    # left join main_df and violation_codes df so that there's more info on violation in main_df
    main_df = pd.merge(main_df, violation_codes, on='Violation Code', how='left')
    # cast precincts as integers from floats (inadvertent type change by merge)
    main_df['Violation Precinct'] = main_df['Violation Precinct'].astype(int)
    end = time.time()
    timing_str = f"\n{'Merge time:':<40}{end - start:8.3f} sec"
    return main_df, timing_str

In [25]:
# Verify that merge_violation_code works as intended
main_w_violation, output = merge_violation_code(main_df, violation_codes)
# Examine output from merge_violation_code to ensure it makes sense
display(main_w_violation.head())
print(output)

Unnamed: 0,Issue Date,Violation County,Violation Precinct,Violation Code,Summons Number,Definition,manhattan_96_and_below,all_other_areas
0,2016-04-17,,0,7,1661,Vehicles photographed going through a red ligh...,50,50
1,2016-04-16,,0,7,1603,Vehicles photographed going through a red ligh...,50,50
2,2016-04-10,,0,7,1589,Vehicles photographed going through a red ligh...,50,50
3,2016-04-23,,0,7,1586,Vehicles photographed going through a red ligh...,50,50
4,2016-03-12,,0,7,1545,Vehicles photographed going through a red ligh...,50,50



Merge time:                                0.217 sec


In [26]:
@bodo.jit
def calculate_total_summons(main_df):
    """
    Calculate the total summonses in dollars for a violation in a precinct on a day
    """
    start = time.time()
    # create column for portion of precinct 96th st. and below
    n = len(main_df)
    portion_manhattan_96_and_below = np.empty(n, np.int64)
    # NOTE: To run Pandas, use this loop.
    # for i in range(n):
    for i in bodo.prange(n):
        x = main_df['Violation Precinct'].iat[i]
        if x < 22 or x == 23:
            portion_manhattan_96_and_below[i] = 1.0
        elif x == 22:
            portion_manhattan_96_and_below[i] = 0.75
        elif x == 24:
            portion_manhattan_96_and_below[i] = 0.5
        else: #other
            portion_manhattan_96_and_below[i] = 0
    main_df["portion_manhattan_96_and_below"] = portion_manhattan_96_and_below

    # create column for average dollar amount of summons based on location
    main_df['average_summons_amount'] = (main_df['portion_manhattan_96_and_below'] * main_df['manhattan_96_and_below']
                                     + (1 - main_df['portion_manhattan_96_and_below']) * main_df['all_other_areas'])

    # get total summons dollars by multiplying average dollar amount by number of summons given
    main_df['total_summons_dollars'] = main_df['Summons Number'] * main_df['average_summons_amount']
    main_df = main_df.sort_values(by=['total_summons_dollars'], ascending=False)
    end = time.time()
    timing_str = f"\n{'Calculate Total Summons Time:':<40}{end - start:8.3f} sec"
    return main_df, timing_str

In [16]:
# Verify that calculate_total_summons works as intended
total_summons, output = calculate_total_summons(main_w_violation)
# Examine output from calculate_total_summons to ensure it makes sense
display(total_summons.head())
print(output)



Unnamed: 0,Issue Date,Violation County,Violation Precinct,Violation Code,Summons Number,Definition,manhattan_96_and_below,all_other_areas,portion_manhattan_96_and_below,average_summons_amount,total_summons_dollars
0,2016-04-17,,0,7,1661,Vehicles photographed going through a red ligh...,50,50,1,50,83050
1,2016-04-16,,0,7,1603,Vehicles photographed going through a red ligh...,50,50,1,50,80150
2,2016-04-10,,0,7,1589,Vehicles photographed going through a red ligh...,50,50,1,50,79450
3,2016-04-23,,0,7,1586,Vehicles photographed going through a red ligh...,50,50,1,50,79300
4,2016-03-12,,0,7,1545,Vehicles photographed going through a red ligh...,50,50,1,50,77250



Calculate Total Summons Time:              1.301 sec


In [17]:
@bodo.jit
def aggregate(main_df):
    '''function that aggregates and filters data
    e.g. total violations by precinct
    '''
    start = time.time()
    filtered_dataset = main_df[['Violation Precinct','Summons Number', 'total_summons_dollars']]
    precinct_offenses_df = filtered_dataset.groupby(by=['Violation Precinct']).sum().reset_index().fillna(0)
    end = time.time()
    timing_str = f"\n{'Aggregate code time:':<40}{end - start:8.3f} sec"
    return precinct_offenses_df, timing_str

In [18]:
# Verify that aggregate works as intended
precinct_offenses_df, output = aggregate(total_summons)
# Examine output from aggregate to ensure it makes sense
display(precinct_offenses_df.head())
print(output)



Unnamed: 0,Violation Precinct,Summons Number,total_summons_dollars
0,34,150532,12667445
1,43,165959,11503985
2,120,65635,4608645
3,163,34,3010
4,11,88,8640



Aggregate code time:                       1.043 sec


---------------------

## Wrapping the Pipeline in a Function

We could keep this notebook as a template computation. Whenever a new dataset of the same form is to be analyzed, we can modify the filename hard-coded into the function `load_parking_tickets` and execute this specific sequence of cells again.

The cell below puts the logic from the preceding sequence of cells into a single function `run_pipeline`.

In [27]:
# Put all the functions in the pipeline into a single function
def run_pipeline():
    start = time.time()
    
    main_df, out1 = load_parking_tickets()
    violation_codes, nyc_precincts_df, out2 = load_violation_precincts_codes()
    main_df, out3 = elim_code_36(main_df)
    main_df, out4 = remove_outliers(main_df)
    main_w_violation, out5 = merge_violation_code(main_df, violation_codes)
    total_summons, out6 = calculate_total_summons(main_w_violation)
    precinct_offenses_df, out7 = aggregate(total_summons)
    
    end = time.time()
    out8 = f"\n{52*'='}\n{'Execution time (run_pipeline):':<40}{end-start:8.3f} sec"
    output_str = ''.join(['\n', out1, out2, out3, out4, out5, out6, out7, out8, '\n'])
    return precinct_offenses_df, output_str

We execute the `run_pipeline` function here (using also the `%%time` cell magic to return the elapsed time for executing all lines in this cell).

In [20]:
%%time
# Verify that run_pipeline works as intended
result, output = run_pipeline()
display(result.head())
print(result.shape)
print(output)

Unnamed: 0,Violation Precinct,Summons Number,total_summons_dollars
0,34,150532,12667445
1,43,165959,11503985
2,120,65635,4608645
3,163,34,3010
4,11,88,8640


(283, 3)


Reading Time:                              1.492 sec
Violation and precincts load Time:         0.631 sec
Eliminate undefined violations time:       0.111 sec
Remove outliers time:                      0.004 sec
Merge time:                                0.379 sec
Calculate Total Summons Time:              0.638 sec
Aggregate code time:                       0.328 sec
Execution time (run_pipeline):            12.302 sec

CPU times: user 17 ms, sys: 19.3 ms, total: 36.3 ms
Wall time: 12.3 s


Another subtlety to observe is that the timings inside the functions definitions only measure run-time execution. That is, whatever overhead costs are involved in *compiling* functions are not accounted for in these performance meatures. In this case, the observed discrepancy between the `Wall time` (as returned by `%%time`) and the `Execution time` (as returned by `run_pipeline`) is about 3 seconds (YMMV). As we've already *executed* the functions in the pipeline, those compiled results have been cached. So the overhead here is the time required to compile the function `run_pipeline`.

If we re-execute the cell again (copied below), this compilation overhead effectively vanishes.

In [21]:
%%time
# Verify that run_pipeline works as intended.
# Executing this a second time requires no compilation.
# The discrepancy between "Wall time" and "Full Run" 
result, output = run_pipeline()
display(result.head())
print(result.shape)
print(output)

Unnamed: 0,Violation Precinct,Summons Number,total_summons_dollars
0,34,150532,12667445
1,43,165959,11503985
2,120,65635,4608645
3,163,34,3010
4,11,88,8640


(283, 3)


Reading Time:                              1.564 sec
Violation and precincts load Time:         0.067 sec
Eliminate undefined violations time:       0.043 sec
Remove outliers time:                      0.004 sec
Merge time:                                0.076 sec
Calculate Total Summons Time:              0.315 sec
Aggregate code time:                       0.215 sec
Execution time (run_pipeline):            12.679 sec

CPU times: user 16.8 ms, sys: 17.5 ms, total: 34.3 ms
Wall time: 12.7 s


There is still a minor difference between `Wall time` and `Execution time`—a few hundred milliseconds. These reflect inevitable communication & set-up overhead that is still much less costly than compilation.

## Boxing & Unboxing

Another potential issue with the pipeline developed above is that data passes in and out of Bodo-compiled functions in several places. The issue is that Bodo and native Python do not always represent data structures in the same way—Bodo represents data using efficient native data structures. This means that passing data from the top-level Python namespace into a Bodo-jitted function involves *'unboxing'* that object from its native Python representation. Similarly, returning an object back to the top-level Python namespace from a Bodo-jitted function requires *'boxing'* that object back into Python's native object representation. These transformations can be costly.

To illustrate this point, let's apply the `bodo.jit` decorator to the function `run_pipeline` from before. The content is identical except for the application of the JIT compiler (and some tailoring of the output).

In [31]:
# Put all the functions in the pipeline into a single jitted function
@bodo.jit
def run_pipeline_jitted():
    start = time.time()
    
    main_df, out1 = load_parking_tickets()
    violation_codes, nyc_precincts_df, out2 = load_violation_precincts_codes()
    main_df, out3 = elim_code_36(main_df)
    main_df, out4 = remove_outliers(main_df)
    main_w_violation, out5 = merge_violation_code(main_df, violation_codes)
    total_summons, out6 = calculate_total_summons(main_w_violation)
    precinct_offenses_df, out7 = aggregate(total_summons)
    
    end = time.time()
    out8 = f"\n{52*'='}\n{'Execution time (run_pipeline_jitted):':<40}{end-start:8.3f} sec"
    output_str = ''.join(['\n', out1, out2, out3, out4, out5, out6, out7, out8, '\n'])
    return precinct_offenses_df, output_str

Let's execute `run_pipeline_jitted` to see if it works. Remember, the first time this executes, the compilation time is reflected in the discrepancy between the reported `Execution time` and the `Wall time` returned by `%%time`.

In [32]:
%%time
# Verify that run_pipeline_jitted works as intended
result, output = run_pipeline_jitted()
display(result.head())
print(result.shape)
print(output)



Unnamed: 0,Violation Precinct,Summons Number,total_summons_dollars
0,34,150532,12667445
1,43,165959,11503985
2,120,65635,4608645
3,163,34,3010
4,11,88,8640


(283, 3)


Reading Time:                              3.429 sec
Violation and precincts load Time:         0.019 sec
Eliminate undefined violations time:       0.045 sec
Remove outliers time:                      0.005 sec
Merge time:                                0.030 sec
Calculate Total Summons Time:              0.125 sec
Aggregate code time:                       0.030 sec
Execution time (run_pipeline_jitted):      3.685 sec

CPU times: user 6.71 ms, sys: 6.6 ms, total: 13.3 ms
Wall time: 16.4 s


The differences in the observed `Execution time` results are noticeable (about 4 seconds for `run_pipeline` vs. 10 seconds for `run_pipeline_jitted`). As the functions internal to `run_pipeline_jitted` have also been passed through `bodo.jit`, the penalty for boxing & unboxing Python data structures is removed. Notice this is a *run-time* cost; it will persist in repeated executions (unlike the compilation cost).

We'll repeat the preceding cell here to see.

In [33]:
%%time
# Verify that run_pipeline_jitted works as intended
result, output = run_pipeline_jitted()
display(result.head())
print(result.shape)
print(output)

Unnamed: 0,Violation Precinct,Summons Number,total_summons_dollars
0,34,150532,12667445
1,43,165959,11503985
2,120,65635,4608645
3,163,34,3010
4,11,88,8640


(283, 3)


Reading Time:                              0.834 sec
Violation and precincts load Time:         0.016 sec
Eliminate undefined violations time:       0.045 sec
Remove outliers time:                      0.003 sec
Merge time:                                0.038 sec
Calculate Total Summons Time:              0.114 sec
Aggregate code time:                       0.043 sec
Execution time (run_pipeline_jitted):      1.095 sec

CPU times: user 5.71 ms, sys: 3.03 ms, total: 8.75 ms
Wall time: 14.6 s


Again, once `run_pipeline_jitted` has compiled once, the `Wall time` is negligibly more than the `Execution time`. For data this small, the above is adequate and, unless we're running `run_pipeline_jitted` thousands or millions of times, we're probably content at this point.

In the next notebook, we'll work with this data pipeline again, modifying it to see how it performs as the data size scales.

---------------------