[![AWS SDK for pandas](_static/logo.png "AWS SDK for pandas")](https://github.com/aws/aws-sdk-pandas)

# 34 - Distributing Calls Using Ray

AWS SDK for pandas supports distribution of specific calls using [ray](https://docs.ray.io/) and [modin](https://modin.readthedocs.io/en/stable/).

When this is enabled, data loading methods return Modin data frames instead of Pandas data frames. Modin provides seamless integration and compatibility with existing Pandas code, with the benefit of distributing operations across your Ray instance. The benefit is being able to operate at a much higher scale.

In [1]:
%pip install "awswrangler[ray,modin]==3.0.0b3"

Importing `awswrangler` when `ray` and `modin` are installed will automatically initialize a local Ray instance.

In [1]:
import awswrangler as wr
print(f"Execution Engine: {wr.engine.get()}")
print(f"Memory Format: {wr.memory_format.get()}")

2022-10-11 11:36:39,413	INFO worker.py:1518 -- Started a local Ray instance.


Execution Engine: EngineEnum.RAY
Memory Format: MemoryFormatEnum.MODIN


#### Enter your bucket name:

In [3]:
bucket = "<BUCKET_NAME>"

#### Read data at scale on the cluster

In [4]:
frame = wr.s3.read_parquet(path="s3://amazon-reviews-pds/parquet/product_category=Furniture/")
frame.head(5)

Read progress: 100%|█████████████████████████████████████████████████████████████████████████████| 10/10 [00:38<00:00,  3.85s/it]


Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date,year
0,US,35680291,R34O1VWWYVAU9A,B000MWFEV6,406798096,Baxton Studio Full Leather Storage Bench Ottom...,5,1,1,N,Y,High quality and roomy,I bought this bench as a storage necessity as ...,2009-05-17,2009
1,US,21000590,RU1I9NHALXPW5,B004C1RULU,239421036,Alera Fraze Series Leather High-Back Swivel/Ti...,3,8,9,N,Y,Do not judge the chair on the first day alone.,Received this chair really fast because I had ...,2012-06-29,2012
2,US,12140069,R2O8R9CLCUQTB8,B000GFWQDI,297104356,Matching Cherry Printer Stand with Casters and...,5,4,4,N,Y,Printer stand made into printer / PC stand,I wanted to get my pc's off the floor and off ...,2009-05-17,2009
3,US,23755701,R12FOIKUUXPHBZ,B0055DOI50,39731200,Marquette Bed,5,6,6,N,Y,Excellent Value!!,Great quality for the price. This bed is easy ...,2012-06-29,2012
4,US,50735969,RK0XUO7P40TK9,B0026RH3X2,751769063,Cape Craftsman Shutter 2-Door Cabinet,3,12,12,N,N,"Nice, but not best quality",I love the design of this cabinet! It's a very...,2009-05-17,2009


The data type will be a Modin DataFrame

In [5]:
type(frame)

modin.pandas.dataframe.DataFrame

However, this type is interoperable with standard Pandas calls:

In [6]:
filtered_frame = frame[frame.helpful_votes > 10]

When writing data to S3, it's recommended to set provide a directory and set `dataset=False`. This will ensure that data gets written by each worker, without it being shuffled across workers. 

In [9]:
result = wr.s3.to_parquet(
    filtered_frame,
    path=f"s3://{bucket}/amazon-reviews/",
    dataset=True,
    dtype={"review_date": "timestamp"},
)
print(f"Data has been written to {len(result['paths'])} files")

Write Progress: 100%|████████████████████████████████████████████████████████████████████████████| 10/10 [00:07<00:00,  1.34it/s]

Data has been written to 10 files



