# Object Storage and Data Pipeline Setup

This notebook demonstrates how to set up and manage data pipelines using rclone and object storage on Chameleon Cloud.

## Prerequisites

- GPU instance already created on Chameleon Cloud
- rclone is pre-installed with S3 credentials configured
- Access to Chameleon Cloud GUI for bucket creation

## Step 1: Initial Setup

### Create S3 Bucket
1. Navigate to the Chameleon Cloud GUI
2. Create a new bucket in your desired region
3. Name the bucket appropriately for your project

### Configure FUSE on GPU Instance
SSH into your GPU instance and run the following command to enable user access to FUSE:

```bash
sudo sed -i '/^#user_allow_other/s/^#//' /etc/fuse.conf
```

### Verify rclone Configuration
Check if S3 configuration is already present in the rclone configuration file:

```bash
nano ~/.config/rclone/rclone.conf
```

Look for existing `rclone_s3` configuration in the file.

## Step 2: Data Pipeline Setup

### Overview
We'll create a pipeline to load training data using the existing `rclone_s3` configuration in the rclone.conf file. This pipeline will transfer data from the object store to the GPU instance.

### ETL Pipeline Execution

The Docker Compose configuration (`docker-compose-etl-s3.yaml`) contains all the necessary pipelines to load data from the object store to the GPU instance.

#### 1. Extract Data
```bash
docker compose -f docker-compose-etl-s3.yaml run extract-data
```

#### 2. Transform Data
```bash
docker compose -f docker-compose-etl-s3.yaml run transform-data
```

#### 3. Load Data
First, set your container name environment variable:
```bash
export RCLONE_CONTAINER=object-persist-YOURNETID
```

Then run the load data pipeline:
```bash
docker compose -f docker-compose-etl-s3.yaml run load-data
```

For sharded data to be loaded, use:
```bash
docker compose -f docker-compose-etl-s3.yaml run shard-data
```

> **Note**: Replace `YOURNETID` with your actual NetID in the container name.

## Step 3: Cleanup and Troubleshooting

If something goes wrong and you need to clean everything up and start from scratch, follow these steps in order:

### 1. Stop All Containers Using food11 Volume
```bash
docker ps -a --filter volume=food11-etl-s3_food11 -q | xargs -r docker rm -f
```

### 2. Bring Down Compose and Remove Volumes
```bash
docker compose -f docker-compose-etl-s3.yaml down -v
```

### 3. Remove Any Remaining food11 Volumes
```bash
docker volume ls | grep food11 | awk '{print $2}' | xargs -r docker volume rm
```

### 4. Clean Up Orphan Containers
```bash
docker container prune -f
```

### 5. Delete Data from Object Store (Optional)
```bash
rclone delete rclone_s3:YOUR_CONTAINER_NAME --rmdirs
```

> **Warning**: Step 5 will permanently delete data from your object store. Only run this if you're certain you want to remove all data.

## Additional Resources

- [rclone Documentation](https://rclone.org/docs/)
- [Chameleon Cloud Documentation](https://chameleoncloud.readthedocs.io/)
- [Docker Compose Documentation](https://docs.docker.com/compose/)

## Next Steps

After completing this pipeline setup, you can:
1. Monitor your data transfer progress
2. Verify data integrity after transfer
3. Set up automated data synchronization schedules
4. Configure additional object storage backends if needed