## Pre-requisites
- Setup a conda venv such that the python version matches that of the ray cluster `conda create -n rayvenv python=3.9.18`
- Activate the conda venv using `conda activate rayvenv`
- Install jupyter using `pip install jupyter`

In [3]:
!pip install ray[client] -q

[0m

## Simple Ray Test

In [2]:
# !kubectl -n raycluster port-forward svc/raycluster-kuberay-head-svc 10001 &

In [3]:
import ray
import os

os.environ['RAY_ADDRESS'] = 'ray://localhost:10001'

# Initialize Ray
# ray.init()

@ray.remote
def square(num):
    """A remote function to compute the square of a number."""
    return num * num

# Create a list to hold references to the asynchronous tasks
futures = []

# Distribute the computation of squares across Ray workers
for i in range(100):
    futures.append(square.remote(i))

# Retrieve and print the results
results = ray.get(futures)
print(results)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961, 1024, 1089, 1156, 1225, 1296, 1369, 1444, 1521, 1600, 1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401, 2500, 2601, 2704, 2809, 2916, 3025, 3136, 3249, 3364, 3481, 3600, 3721, 3844, 3969, 4096, 4225, 4356, 4489, 4624, 4761, 4900, 5041, 5184, 5329, 5476, 5625, 5776, 5929, 6084, 6241, 6400, 6561, 6724, 6889, 7056, 7225, 7396, 7569, 7744, 7921, 8100, 8281, 8464, 8649, 8836, 9025, 9216, 9409, 9604, 9801]


## Download dataset and push to S3

### Pre-requisites for pushing to S3 (MinIO)
- If you're using the GOKU architecture, setup MinIO as described in the setup docs.
- Create a bucket -> Create a policy -> Create a user based on the polichy -> Create access key for the user

In [5]:
!bash download_dataset.sh # get from https://github.com/aishwaryaprabhat/Advanced-RAG/blob/main/download_dataset.sh

Cloning into 'DataRepository'...
remote: Enumerating objects: 54, done.[K
remote: Counting objects: 100% (46/46), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 54 (delta 15), reused 20 (delta 7), pack-reused 8[K
Receiving objects: 100% (54/54), 51.28 MiB | 18.17 MiB/s, done.
Resolving deltas: 100% (15/15), done.
mkdir: cannot create directory ‘source_docs’: File exists
Archive:  DataRepository/high-performance-rag/Camel Papers Test.zip
  inflating: source_docs/Acute respiratory distress syndrome in an alpaca cria.pdf  
  inflating: source_docs/Alpaca liveweight variations and fiber production in Mediterranean range of Chile.pdf  
Archive:  DataRepository/high-performance-rag/Camel Papers Train.zip
  inflating: source_docs/Antibody response to the epsilon toxin ofClostridium perfringensfollowing vaccination of Lama glamacrias.pdf  
  inflating: source_docs/Comparative pigmentation of sheep, goats, and llamas what colors are possible through selection.pdf  

In [7]:
!pip install boto3 -q

[0m

In [10]:
# !kubectl -n minio port-forward svc/minio 9000 &

In [9]:
import boto3
import os

def upload_directory_to_minio(bucket_name, directory_path, endpoint_url, access_key, secret_key):
    # Create a boto3 session
    session = boto3.session.Session()

    # Create an S3 client configured for MinIO
    s3_client = session.client(
        service_name='s3',
        aws_access_key_id=access_key,
        aws_secret_access_key=secret_key,
        endpoint_url=endpoint_url,
        region_name='us-east-1',  # This can be any string
        config=boto3.session.Config(signature_version='s3v4')
    )

    # Ensure bucket exists (create if not)
    # Note: MinIO may require manual bucket creation or different permissions setup
    try:
        if s3_client.head_bucket(Bucket=bucket_name):
            print(f"Bucket {bucket_name} already exists.")
    except:
        s3_client.create_bucket(Bucket=bucket_name)
        print(f"Bucket {bucket_name} created.")

    # Upload each PDF in the directory
    for root, dirs, files in os.walk(directory_path):
        for file in files:
            if file.lower().endswith('.pdf'):
                file_path = os.path.join(root, file)
                object_name = os.path.relpath(file_path, directory_path).replace("\\", "/")  # Ensure proper path format
                try:
                    s3_client.upload_file(file_path, bucket_name, object_name)
                    print(f"Uploaded {file_path} as {object_name}")
                except Exception as e:
                    print(f"Failed to upload {file_path}: {e}")

# Usage example
endpoint_url = 'http://localhost:9000'  # Example: 'http://127.0.0.1:9000'
access_key = ''
secret_key = ''
bucket_name = 'unstructured-data'
directory_path = 'source_docs'

upload_directory_to_minio(bucket_name, directory_path, endpoint_url, access_key, secret_key)


Bucket unstructured-data already exists.
Uploaded source_docs/The physiological impact of wool-harvesting procedures in vicunas (Vicugna vicugna)..pdf as The physiological impact of wool-harvesting procedures in vicunas (Vicugna vicugna)..pdf
Uploaded source_docs/Respiratory mechanics and results of cytologic examination of bronchoalveolar lavage fluid in healthy adult alpacas.pdf as Respiratory mechanics and results of cytologic examination of bronchoalveolar lavage fluid in healthy adult alpacas.pdf
Uploaded source_docs/Neurological Causes of Diaphragmatic Paralysis in 11 Alpacas.pdf as Neurological Causes of Diaphragmatic Paralysis in 11 Alpacas.pdf
Uploaded source_docs/Serum and urine analyte comparison between llamas and alpacas fed three forages.pdf as Serum and urine analyte comparison between llamas and alpacas fed three forages.pdf
Uploaded source_docs/Influence of effects on quality traits and relationships between traits of the llama fleece..pdf as Influence of effects on qu

In [11]:
!rm -rf source_docs

![](assets/minio_data.png)