the example script here runs the command
```
./sagmaker_submit_dir/run_scraper.py --from-id 8627380 --to-id 8627391 --local-dir danbooru_downloads --upload-dir s3://dataset-ingested/danbooru
```
on a sagemaker ml.m5.xlarge instance. The script downloads images from danbooru and uploads them to an s3 bucket

In [6]:
from sagemaker.pytorch import PyTorch

def launch_scraper_job(start_id: int, end_id: int, local_dir: str, upload_dir: str, 
                       instance_type: str = "ml.m5.xlarge", max_run: int = 7200):
    """
    Launch a single SageMaker job to run the scraper script.

    Args:
        start_id (int): Starting post ID for the scraper.
        end_id (int): Ending post ID for the scraper.
        local_dir (str): Local directory to store scraped data.
        upload_dir (str): S3 bucket URI to upload the scraped data.
        instance_type (str): SageMaker instance type to use.
        max_run (int): Maximum run time in seconds (default: 2 hours).
    """
    # Define hyperparameters
    hyperparameters = {
        'from-id': start_id,
        'to-id': end_id,
        'local-dir': local_dir,
        'upload-dir': upload_dir,
    }

    # Estimator configuration
    estimator = PyTorch(
        entry_point='run_scraper.py',
        source_dir='/home/ubuntu/danbooru-scraper/notebooks/sagmaker_submit_dir',
        role='sagemaker_training_execution_role',  # Replace with your SageMaker role ARN
        instance_count=1,
        instance_type=instance_type,
        image_uri='763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.1-cpu-py310',
        hyperparameters=hyperparameters,
        max_run=max_run,
        volume_size=50,  # Adjust based on your storage needs
    )

    # Job name for tracking
    job_name = f"scraper-job-{start_id}-{end_id}"

    # Launch the job
    print(f"Launching job: {job_name}")
    estimator.fit(wait=False, job_name=job_name)
    print(f"Job {job_name} completed.")

In [7]:
launch_scraper_job(
    start_id=8627380,
    end_id=8627391,
    local_dir='danbooru_downloads',
    upload_dir='s3://dataset-ingested/danbooru'
)

Launching job: scraper-job-8627380-8627391


2024-12-28 07:04:57 Starting - Starting the training job...
2024-12-28 07:05:12 Starting - Preparing the instances for training...
2024-12-28 07:05:48 Downloading - Downloading the training image......Job scraper-job-8627380-8627391 completed.
