In [1]:
# will not copy to the main file
import os,sys,inspect
import pandas as pd
from memory_profiler import memory_usage

In [2]:
# will not copy to the main file
%load_ext rpy2.ipython
%load_ext memory_profiler



In [3]:
import pyarrow.feather as feather

In [4]:
# will not copy to the main file
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir) # this refers to the project root folder

In [5]:
processed_folder = os.path.join(parentdir,"data", "processed")

input_path = os.path.join(processed_folder, "combined_data.csv")
output_path = os.path.join(processed_folder, "combined_data.feather")

In [6]:
%%time
%memit

df = pd.read_csv(input_path)

In [7]:
%%time
%memit

feather.write_feather(df, output_path)

In [8]:
# %%R

# install.packages("dplyr")
# library(arrow)

In [9]:
%%time
%%R
library(dplyr)
library(arrow)
start_time <- Sys.time()
r_table <- arrow::read_feather("../data/processed/combined_data.feather")
print(class(r_table))
result <- r_table %>% count(model)
end_time <- Sys.time() 
print(result)
print(end_time - start_time)

R[write to console]: 
Attaching package: 'dplyr'


R[write to console]: The following objects are masked from 'package:stats':

    filter, lag


R[write to console]: The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


R[write to console]: 
Attaching package: 'arrow'


R[write to console]: The following object is masked from 'package:utils':

    timestamp




[1] "tbl_df"     "tbl"        "data.frame"
[38;5;246m# A tibble: 27 x 2[39m
   model                  n
   [3m[38;5;246m<chr>[39m[23m              [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m ACCESS-CM2       1[4m9[24m[4m3[24m[4m2[24m840
[38;5;250m 2[39m ACCESS-ESM1-5    1[4m6[24m[4m1[24m[4m0[24m700
[38;5;250m 3[39m AWI-ESM-1-1-LR    [4m9[24m[4m6[24m[4m6[24m420
[38;5;250m 4[39m BCC-CSM2-MR      3[4m0[24m[4m3[24m[4m5[24m340
[38;5;250m 5[39m BCC-ESM1          [4m5[24m[4m5[24m[4m1[24m880
[38;5;250m 6[39m CanESM5           [4m5[24m[4m5[24m[4m1[24m880
[38;5;250m 7[39m CMCC-CM2-HR4     3[4m5[24m[4m4[24m[4m1[24m230
[38;5;250m 8[39m CMCC-CM2-SR5     3[4m5[24m[4m4[24m[4m1[24m230
[38;5;250m 9[39m CMCC-ESM2        3[4m5[24m[4m4[24m[4m1[24m230
[38;5;250m10[39m EC-Earth3-Veg-LR 3[4m0[24m[4m3[24m[4m7[24m320
[38;5;246m# ... with 17 more rows[39m
Time difference of 19.49337 secs
Wall time: 20.5 s


## Conclusion

After having thoroughly discussed, our team have reached a consensus that `Feather` file is the most suitable approach to transfer the dataframe from Python to R. We use read / write speed and the ability to support various operations as the criteria to make the decision.

As the input data is around 5.7GB, it is extremely slow to use `Pandas` exchange, hence its rejection. We do not select `Arrow` exchange as it only supports some operations (https://arrow.apache.org/docs/r/articles/dataset.html - this is given in his lecture note). `Parquet` file, even though is quite fast, is still slower than `Feather` V2 version when being read. Similarly, it is much faster to write into a `Feather` file than a `Parquet` file from a Python dataframe (https://ursalabs.org/blog/2020-feather-v2/ - this is given at the end of his lecture note as well). Even though a `Parquet` file can save more storage space, it is not a main concern for us as storage cost is cheap given this file size.