# Experimentation and troubleshooting for Dan

Any important learnings should get documented/folded into the tutorial somehow!

In [60]:
!pip install pyarrow==7.0 --user

Collecting pyarrow==7.0
  Using cached pyarrow-7.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
Installing collected packages: pyarrow
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
xgboost-ray 0.1.4 requires pyarrow<5.0.0, but you have pyarrow 7.0.0 which is incompatible.
dominodatalab-data 0.2.1 requires backoff<2.0.0,>=1.11.1, but you have backoff 1.10.0 which is incompatible.
dominodatalab-data 0.2.1 requires pyarrow<7.0.0,>=6.0.0, but you have pyarrow 7.0.0 which is incompatible.[0m
Successfully installed pyarrow-7.0.0


In [None]:
#ray.shutdown()

In [1]:
import ray
import os

# os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION']='python'

if ray.is_initialized() == False:
    service_host = os.environ["RAY_HEAD_SERVICE_HOST"]
    service_port = os.environ["RAY_HEAD_SERVICE_PORT"]
    ray.init(f"ray://{service_host}:{service_port}")

In [2]:
import numpy as np
import pandas as pd
import modin.pandas as mpd
import pyarrow.parquet as pq
import pyarrow.dataset as pds
import pyarrow as pa

In [13]:
# Generate some dummy data in multiple files
default_dataset_path = f"/domino/datasets/local/{os.environ['DOMINO_PROJECT_NAME']}"

def generate_dummy_data_filesplit(n_rows, n_columns, name, n_parts = 10):
    dummy_file_root = os.path.join(default_dataset_path, f"{name}")
    data_columns = {f"col_{i}": np.random.standard_normal(n_rows) for i in range(n_columns)}
    n_per = n_rows // n_parts
    pds.write_dataset(
        pa.Table.from_pydict(data_columns),
        dummy_file_root,
        format='parquet',
        max_rows_per_file = n_per,
        max_rows_per_group = n_per
    )

In [16]:
generate_dummy_data_filesplit(10**7, 20, "medium-filesplit")

[2m[33m(raylet, ip=10.0.66.231)[0m [2022-08-08 17:08:29,959 C 19 19] gcs_client.cc:328: Couldn't reconnect to GCS server. The last attempted GCS server address was :0
[2m[33m(raylet, ip=10.0.66.231)[0m *** StackTrace Information ***
[2m[33m(raylet, ip=10.0.66.231)[0m     ray::SpdLogMessage::Flush()
[2m[33m(raylet, ip=10.0.66.231)[0m     ray::RayLog::~RayLog()
[2m[33m(raylet, ip=10.0.66.231)[0m     ray::gcs::GcsClient::ReconnectGcsServer()
[2m[33m(raylet, ip=10.0.66.231)[0m     std::function<>::operator()()
[2m[33m(raylet, ip=10.0.66.231)[0m     std::_Function_handler<>::_M_invoke()
[2m[33m(raylet, ip=10.0.66.231)[0m     ray::rpc::ClientCallImpl<>::OnReplyReceived()
[2m[33m(raylet, ip=10.0.66.231)[0m     std::_Function_handler<>::_M_invoke()
[2m[33m(raylet, ip=10.0.66.231)[0m     boost::asio::detail::completion_handler<>::do_complete()
[2m[33m(raylet, ip=10.0.66.231)[0m     boost::asio::detail::scheduler::do_run_one()
[2m[33m(raylet, ip=10.0.66.231)[0

In [24]:
!ls /domino/datasets/local/Ray-Tutorial/smallest-split

13355247611c4ffd9ffff1e218e9bd9e.parquet


In [52]:
def generate_dummy_data_split(n_rows, n_columns, name):
    dummy_file_path = os.path.join(default_dataset_path, f"{name}.parquet")
    arr = np.random.standard_normal((n_rows, n_columns))
    df = pd.DataFrame(arr, columns = [str(i) for i in range(n_columns)])
    df.to_parquet(dummy_file_path, row_group_size = n_rows // 10)
    size_gib = os.path.getsize(dummy_file_path)/(1024*1024*1024)
    print(f"With {n_rows} rows and {n_columns} columns the new file is {size_gib} GiB on disk")
    return dummy_file_path

In [53]:
generate_dummy_data_split(10**7, 20, "medium-split")

With 10000000 rows and 20 columns the new file is 1.5422909455373883 GiB on disk


'/domino/datasets/local/Ray-Tutorial/medium-split.parquet'

#### Try with ray data

In [3]:
medium_file = '/domino/datasets/local/Ray-Tutorial/medium-filesplit'

In [4]:
ds = ray.data.read_parquet(medium_file, parallelism = 10)

In [5]:
ds.show(3)

{'col_0': 1.158893411011255, 'col_1': 0.6024888481795232, 'col_2': 0.8547269707463037, 'col_3': -0.7717945711432682, 'col_4': 1.327244959107909, 'col_5': 1.6760976533642786, 'col_6': 0.851013834218383, 'col_7': -0.6105415319538923, 'col_8': -0.8987464297029039, 'col_9': 0.1937053286234583, 'col_10': -0.4011255531356019, 'col_11': -1.142756618209515, 'col_12': -2.0355433205668514, 'col_13': -1.6062197382186973, 'col_14': -1.7626335335372112, 'col_15': -0.5957913560997585, 'col_16': 1.2440529127597844, 'col_17': 0.08486158782383446, 'col_18': -0.9278595131767658, 'col_19': 0.8911579179072887}
{'col_0': -1.1585311566375038, 'col_1': 0.155681761397261, 'col_2': -0.10374564341978187, 'col_3': 0.7309110889294081, 'col_4': -0.8441261132098692, 'col_5': -0.4426172541321336, 'col_6': 2.145427727628284, 'col_7': 0.7734108353639534, 'col_8': 0.15244208585701527, 'col_9': 0.5341579368697362, 'col_10': 0.21354196252623878, 'col_11': 0.47904246676231227, 'col_12': 0.03953915050317926, 'col_13': -0.1

In [6]:
def dummy_transform_batch(t: pa.Table) -> pd.DataFrame:
    return t.to_pandas().sum(axis=1).to_frame()

In [7]:
ds2 = ds.map_batches(dummy_transform_batch)

Map Progress: 100%|██████████| 10/10 [00:04<00:00,  2.49it/s]


In [8]:
ds2.show(3)

{'0': -1.868768742002446}
{'0': -0.6332290800922385}
{'0': -0.7050184444842438}


### Final Notes
Yesss, victory!!!!

Must clean this up later and summarize for Dan and Ben, etc. Don't forget:
* Needed pyarrow 7.0 to get the convenient option for max_rows_per_file
* `ray.data.read_parquet` needs `parallelism` set to be smart - it fails if you do not set this even on the split files (prove that again to be sure). But it also fails if you try to set it on the monolith file because of the absolute money quote in the docs "Parallelism may be limited by the number of files of the dataset." (From https://docs.ray.io/en/latest/data/package-ref.html)
* Note the Ray Web UI stuff closer - it does seem when parallelism is enabled ray data is somewhat lazy. The ds.show did not seem to trigger a full read, needed my dummy sum operation to do that. Should I make the modin example do something similar to make sure it is truly smart enough to do similar things on the monolith file? (Maybe this is getting into more advanced details-of-intermediate-tutorial areas.)