This example is based on:
https://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes


# Using bodo to read and analyse NYC taxi data from 2015

This notebook aims to guide you throught the first steps using **bodo**. Also, it has some comparison an notes about the using **bodo** and pure python (+ numpy + pandas) 

## Setup

First, you probably will need to download the NYC Yellow taxi data from 2015.
Uncomment the following code and change the **DATA_DIR** to your local directory, so you can run the cell.

In [14]:
"""
%%bash

URL_ROOT=https://s3.amazonaws.com/nyc-tlc/trip+data
DATA_DIR=/work/bodoai/dataset/nyc-trip-2015

for i in {1..12}; do
    month=`printf "%2.0d\n" $i |sed "s/ /0/"`;
    FILENAME=yellow_tripdata_2015-${month}.csv;
    # wget -c ${URL_ROOT}/${FILENAME} -O ${DATA_DIR}/${FILENAME};
done
""";

As **bodo** **read_csv** works easily with constant paths than variable paths, you can link this folder inside our current path (optional, uncomment the code below if you want to run it).

In [16]:
"""
%%bash

mkdir -p ./data
ln -s /work/bodoai/dataset/nyc-trip-2015 ./data
""";

In [17]:
!ls ./data/nyc-trip-2015

yellow_tripdata_2015-01.csv  yellow_tripdata_2015-03.csv
yellow_tripdata_2015-02.csv


As it is easier to work with `parquet` file than `csv` files, we will convert the `csv` files to a single `parquet` file (for now lets just use the first 3 files).

In [22]:
import pandas as pd

In [21]:
dfs = []
data_dir = './data/nyc-trip-2015'

for i in range(1, 4):
    df = pd.read_csv(
        f'{data_dir}/yellow_tripdata_2015-{str(i).rjust(2, "0")}.csv',
        parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
    )
    dfs.append(df)
    del df

df = pd.concat(dfs)
df.to_parquet('./data/nyc-trip-2015/q1.pq')
del df

In [25]:
!ls ./data/nyc-trip-2015

q1.pq			     yellow_tripdata_2015-02.csv
yellow_tripdata_2015-01.csv  yellow_tripdata_2015-03.csv


Now it is almost ready to start, lets just import `bodo` (and `time` for profiling).

In [23]:
import time
import bodo

### Bodo px magic command

You can also using `%px` magic command to run your cell using 
[MPI](https://ipyparallel.readthedocs.io/en/latest/magics.html).
So, use the ipyparallel to connect to the MPI controller you created 
(check [bodo documentation](https://docs.bodo.ai/latest/source/user_guide.html)
for more information).

In [None]:
import ipyparallel as ipp
c = ipp.Client(profile="mpi")
view = c[:]
view.activate()
view.block = True
import os
view["cwd"] = os.getcwd()
%px cd $cwd

# Load data from CSV files

For an experimental purpose, it is using just the first 3 months of 2015, 
it is about 6GB in disk and 30 GB in RAM (for loading the data, not the dataframe in memory).

In [29]:
# @bodo.jit
def read_data():
    return pd.read_parquet('./data/nyc-trip-2015/q1.pq')


t0 = time.time()
df = read_data()
t1 = time.time()

print('read_data execution time:', t1 - t0, 's')

read_data execution time: 11.349582195281982 s


Lets check some information about the dataframe:

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38551116 entries, 0 to 13351608
Data columns (total 19 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        int64         
 4   trip_distance          float64       
 5   pickup_longitude       float64       
 6   pickup_latitude        float64       
 7   RateCodeID             int64         
 8   store_and_fwd_flag     object        
 9   dropoff_longitude      float64       
 10  dropoff_latitude       float64       
 11  payment_type           int64         
 12  fare_amount            float64       
 13  extra                  float64       
 14  mta_tax                float64       
 15  tip_amount             float64       
 16  tolls_amount           float64       
 17  improvement_surcharge  float64       
 18  total_amount        

Now, using bodo:

In [31]:
read_data = bodo.jit(read_data)

t0 = time.time()
df = read_data()
t1 = time.time()

print('read_data execution time:', t1 - t0, 's')



read_data execution time: 16.90570092201233 s


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38551116 entries, 0 to 13351608
Data columns (total 19 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        int64         
 4   trip_distance          float64       
 5   pickup_longitude       float64       
 6   pickup_latitude        float64       
 7   RateCodeID             int64         
 8   store_and_fwd_flag     object        
 9   dropoff_longitude      float64       
 10  dropoff_latitude       float64       
 11  payment_type           int64         
 12  fare_amount            float64       
 13  extra                  float64       
 14  mta_tax                float64       
 15  tip_amount             float64       
 16  tolls_amount           float64       
 17  improvement_surcharge  float64       
 18  total_amount        

**NOTE:** As you can see the consumed time was more than the python pure version. Probably it can have more benefits when using more data.

## Basic Aggregations and Groupbys

Let's try some basic aggregation on our dataframe:

In [37]:
def mean_by_each_passanger_count(df):
    return df.groupby(df.passenger_count).trip_distance.mean()


t0 = time.time()
se = mean_by_each_passanger_count(df)
t1 = time.time()

print(se)

print('\nmean_by_each_passanger_count execution time:', t1 - t0, 's')

passenger_count
0     2.245734
1    16.757501
2    12.369026
3   -18.456581
4    14.271708
5     2.912368
6     2.838362
7     4.658491
8     2.426512
9     5.477000
Name: trip_distance, dtype: float64

mean_by_each_passanger_count execution time: 0.4866049289703369 s


In [38]:
mean_by_each_passanger_count = bodo.jit(mean_by_each_passanger_count)


t0 = time.time()
se = mean_by_each_passanger_count(df)
t1 = time.time()

print(se)

print('\nmean_by_each_passanger_count execution time:', t1 - t0, 's')


IndexError: Failed in bodo mode pipeline (step: <class 'bodo.compiler.BodoSeriesPass'>)
list index out of range