# Fine-tuning and Serving a Large Language Model (LLM) with Ray on Amazon EKS

Language models have revolutionized the field of natural language processing, enabling applications such as chatbots, sentiment analysis, and content generation. Fine-tuning these models for specific tasks and deploying them at scale is a crucial aspect of leveraging their power effectively.

In this tutorial, we will explore the process of fine-tuning a pre-trained LLM, optimizing it for your specific needs, and then deploying it on Amazon EKS. Ray, a distributed computing framework, will be our ally in managing the complexity of training and serving such models efficiently.

Whether you're a data scientist looking to enhance your NLP projects or an engineer tasked with deploying AI-powered applications in production, this notebook will guide you through the entire journey.

## Requirements

### Installing core components

In [None]:
# Restart kernel after install
! pip install --upgrade pip
! pip install -U "ray[air]" "boto3==1.28.47" "ray==2.6.3" "protobuf==3.19.6" "jupyter-server==1.7" "NumPy==1.22.0" "ipywidgets>=8"
! pip install "datasets==2.14.5" "evaluate==0.4.0" "einops==0.6.1" "accelerate==0.23.0" "transformers>=4.33.1" "torch==2.0.1" "deepspeed==0.9.3" "peft==0.4.0" "bitsandbytes==0.41.1" "loralib==0.1.2" "xformers==0.0.21" 
! pip install pandas --upgrade 

### Global variables

After installing the requirements, we will add some global variables

In [None]:
# Fine-tuning variables
model_name = "tiiuae/falcon-7b" # The pre-trained model you are going to utilize
dataset_name = "gbharti/finance-alpaca" # The dataset we are going to utilize for fine-tuning
bucket = "<REPLACE HERE WITH YOUR BUCKET NAME CREATED BY TERRAFORM>" # Where you are going to store your dataset
use_gpu = True # Enable GPU for fine-tuning in Ray cluster
num_workers = 4 # Number or workers to use for Ray cluster
cpus_per_worker = 8 # Number of CPUs per worker to use for Ray cluster
storage_path=f"s3://{bucket}/checkpoints/" # Since this example runs with multiple nodes, we need to persist checkpoints and other outputs to some external storage for access after training has completed
ray_train_address = "ray-cluster-train-kuberay-head-svc.ray-cluster-train.svc.cluster.local" # Internal Ray Cluster training address powered by CoreDNS
ray_serve_address = "ray-svc-non-finetuned-head-svc.ray-svc-non-finetuned.svc.cluster.local" # Internal Ray Cluster training address powered by CoreDNS
train_dependencies = [
    "awscli",
    "datasets==2.14.5",
    "evaluate==0.4.0",
    "einops==0.6.1",
    "accelerate==0.23.0",
    "transformers>=4.33.1",
    "torch==2.0.1",
    "deepspeed==0.9.3",
    "peft==0.4.0",
    "bitsandbytes==0.41.1",
    "loralib==0.1.2",
    "xformers==0.0.21"
]

You can change the `model_name` variable for any pre-trained models availabe at [Hugging Face](https://huggingface.co/models).

Some examples are:
- [Llama2](https://huggingface.co/docs/transformers/main/en/model_doc/llama2)
- [Falcon](https://huggingface.co/docs/transformers/main/en/model_doc/falcon)
- [GPT-J](https://huggingface.co/docs/transformers/main/en/model_doc/gptj)
- [Falcon Lite](https://huggingface.co/amazon/FalconLite)
- [Light GPT](https://huggingface.co/amazon/LightGPT)

For `dataset_name` you can change for your dataset. The dataset structure depends on the task and the model you are trying to fine-tune. You can check some datasets at [Hugging Face](https://huggingface.co/datasets).

As example:
- [Finance Alpaca](https://huggingface.co/datasets/gbharti/finance-alpaca)
- [Code Instructions](https://huggingface.co/datasets/iamtarun/code_instructions_120k_alpaca)
- [Falcon RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)
- [Mental Health](https://huggingface.co/datasets/Amod/mental_health_counseling_conversations)

### Connect to Ray cluster deployed in Amazon EKS

In [None]:
import ray

In [None]:
# Connecting Ray client with the cluster
ray.shutdown()
ray.init(
    address=f"ray://{ray_train_address}:10001",
    runtime_env={
        "pip": train_dependencies
    }
)

## Fine-tuning LLM

Fine-tuning a Large Language Model (LLM) refers to the process of taking a pre-trained language model, like Falcon 7B or its variants, and further training it on a specific dataset or task to make it more specialized and useful for that particular task. This process is commonly used in natural language processing (NLP) to adapt a general-purpose language model to specific applications.

Here are the key steps involved in fine-tuning a LLM:

1. **Pre-trained Model**: Start with a pre-trained LLM that has been trained on a large corpus of text data. These models are typically trained on a massive scale and have learned a wide range of language patterns and knowledge.

2. **Task Definition**: Define the specific NLP task or application you want to use the model for. This could be sentiment analysis, text classification, language translation, question answering, etc.

3. **Data Preparation**: Collect or create a dataset that is relevant to your task. This dataset should include labeled examples for supervised tasks (e.g., pairs of input and corresponding output for translation) or unstructured text data for tasks like language modeling or text generation.

4. **Fine-Tuning Process**:
   - Initialize the pre-trained LLM with its weights and parameters.
   - Train the model on your task-specific dataset.
   - During fine-tuning, the model adjusts its weights based on the new dataset while retaining much of the knowledge it gained during pre-training.
   - The fine-tuning process typically involves multiple training epochs, and you can monitor performance on a validation dataset to determine when to stop training.

5. **Evaluation**: After fine-tuning, assess the model's performance on a separate evaluation dataset. This helps you gauge how well the model has adapted to the specific task.

Fine-tuning allows you to leverage the capabilities of large pre-trained models and transfer their general language understanding to specific tasks, saving significant time and resources compared to training a model from scratch. It's a common practice in NLP to achieve state-of-the-art results on various tasks and domains.

### Testing general chat model without specific task fine-tuning

Pre-trained models are good to generate text but not for chat. In this first test we are going to utilize [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct), a fine-tuned model from Falcon-7B but with a [general dataset](https://huggingface.co/datasets/tiiuae/falcon-refinedweb).

In [None]:
from ray import serve
import pandas as pd
from starlette.requests import Request

In [None]:
@serve.deployment(ray_actor_options={"num_gpus": 1})
class PredictDeployment:
    def __init__(self, model_id: str):
        from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
        import torch
        
        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            trust_remote_code=True,
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
        ).cuda()
        
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)

        self.config = GenerationConfig(
            temperature=0.7,
            top_p=0.9,
            num_beams=4,
            include_prompt_in_result=False,
        )

    def generate(self, prompt, params):
        inputs = self.tokenizer(prompt, return_tensors="pt")
        input_ids = inputs.input_ids.to(self.model.device)
        self.config.temperature = params["temperature"]
        self.config.top_p = params["top_p"]
        self.config.num_beams = params["num_beams"]

        generation_output = self.model.generate(
            input_ids,
            generation_config=self.config,
            max_new_tokens=params['max_tokens'],
            return_dict_in_generate=True,
            output_scores=False
        )
        
        answer=[]
        for seq in generation_output.sequences:
            output = self.tokenizer.decode(seq, skip_special_tokens=True)
            answer.append(output.split("### Answer:")[-1].strip())
        
        return answer[0]

    async def __call__(self, http_request: Request) -> str:
        json_request: str = await http_request.json()
        prompt = json_request["prompt"]
        params = json_request["params"]
        return self.generate(prompt, params)

In [None]:
serve.start(detached=True)

> The next step may show a Warning message about different HTTP configurations for field ['location']. You can ignore this message.

In [None]:
serve.run(PredictDeployment.bind(model_id="tiiuae/falcon-7b"))
serve.get_deployment("default_PredictDeployment").url

Execute the `port-forward` command below to expose the Ray Cluster train dashboard. 

```bash
kubectl port-forward svc/ray-cluster-train-kuberay-head-svc 8265:8265 -nray-cluster-train
```

Then open a new browser tab and open the `http://localhost:8265` to take a look in our Ray Dashboard.

Go back to your local terminal, start new terminal window and test the LLM. First run a new `port-forward` command:

```bash
kubectl port-forward -n ray-cluster-train --address 0.0.0.0 svc/ray-cluster-train-kuberay-head-svc 8000:8000
```

Now, in another terminal, run the following `curl` and wait a few minutes for the response. It may be repetitive, and this is because our model is not well trained yet.

```bash
curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "Why do I need an emergency fund if I already have investments?",
    "params": {
        "temperature": 0.7,
        "top_p": 0.9,
        "max_tokens": 256,
        "num_beams": 4
    }
}' http://127.0.0.1:8000/

```

- **Question**: "Why do I need an emergency fund if I already have investments?"
- **Answer**: "You need an emergency fund because you don’t know when you’ll need it, and you don’t know how much you’ll need."

> The next step may show an Info for updated deployment replicas.

In [None]:
serve.delete("default")
serve.shutdown()

### Fine-tuning model with LoRA and specific dataset

Fine-Tuning empowers us to harness the capabilities of pre-trained foundational models and customize them to suit specific tasks or domains. Through the training of the model using data specific to the domain, we can customize it to excel in precisely defined tasks.

This procedure can demand significant resources and incur substantial costs, given that we will be altering all the millions of parameters during the training process. Fine-tuning the model necessitates a substantial amount of training data, extensive infrastructure, and considerable effort. In the course of fully fine-tuning Large Language Models (LLMs), there exists a risk of experiencing catastrophic forgetting, which entails the loss of previously acquired knowledge from the initial pretraining phase.

Numerous methodologies, such as Parameter Efficient Fine Tuning (PEFT), offer a means to conduct fine-tuning in a modular fashion, optimizing resource utilization and cost-effectiveness. PEFT serves as a fine-tuning technique crafted to minimize the demand for extensive resources and cost outlays. It emerges as an excellent choice when confronting domain-specific tasks that call for model adaptation. Through the employment of PEFT, we can strike a delicate balance between preserving valuable knowledge from the pre-trained model and efficiently adapting it to the target task while reducing the number of parameters. Several approaches enable parameter-efficient fine-tuning, with Low Rank Parameter (LoRA) and QLoRA being the most widely employed and effective methods.

![LoRA and QLoRA](https://github.com/aws-samples/gen-ai-on-eks/blob/main/notebooks/fine-tuning-methods.png?raw=true)

#### Checking dataset

In [None]:
import pandas as pd
from datasets import load_dataset
from datasets import Dataset, DatasetDict

In [None]:
dataset = load_dataset(dataset_name, split="train")

# Convert the dataset to a pandas DataFrame
df = pd.DataFrame(dataset)

# Display the first few rows of the DataFrame
df.head()

In [None]:
# Rename input and output
df.rename(columns={'output': 'Response', 'instruction': 'Context'}, inplace=True)

# Drop other columns
df.drop(columns=df.columns.difference(['Response', 'Context']), inplace=True)

# Reorder
df = df[["Context", "Response"]]

df.head()

In [None]:
# Save just 1k samples for demo purpose
df = df.sample(n=1000, random_state=0)
df.shape

## Start fine-tuning with Ray's Job Submission API

The reason for using Ray's Job Submission API instead of trainer.fit() directly in a Jupyter Notebook is that the latter doesn't allow you to see the logs directly within the notebook interface. Using the Job Submission API gives you more control over job monitoring and log inspection, which is especially useful for debugging and real-time monitoring of training progress.

In [None]:
# close connection to Ray cluster
ray.shutdown()

In [None]:
import boto3
from datetime import datetime
from ray.job_submission import JobSubmissionClient

We are going to create a training script for job submission

> **Make sure to change the bucket variable with the correct bucket name**

In [None]:
%%writefile train_script.py
import os
import ray
import torch
import evaluate
import numpy as np
import pandas as pd
from datasets import load_dataset
from datasets import Dataset
from ray.data.preprocessors import BatchMapper, Chain
from transformers import AutoTokenizer, DataCollatorForLanguageModeling, AutoModelForCausalLM, BitsAndBytesConfig, Trainer, TrainingArguments
from ray.train.huggingface import TransformersTrainer
from ray.air import RunConfig, ScalingConfig, session
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, prepare_model_for_int8_training

# GLOBAL VARIABLES DEFINITION, those can be captured as parameters 
model_name = "tiiuae/falcon-7b" # The pre-trained model you are going to utilize
dataset_name = "gbharti/finance-alpaca" # The dataset we are going to utilize for fine-tuning
bucket = "<REPLACE HERE WITH YOUR BUCKET NAME CREATED BY TERRAFORM>" # Where you are going to store your dataset
use_gpu = True # Enable GPU for fine-tuning in Ray cluster
num_workers = 4 # Number or workers to use for Ray cluster
cpus_per_worker = 8 # Number of CPUs per worker to use for Ray cluster
storage_path=f"s3://{bucket}/checkpoints/" # Since this example runs with multiple nodes, we need to persist checkpoints and other outputs to some external storage for access after training has completed
ray_train_address = "auto"
os.environ['CUDA_HOME'] = "/usr/local/cuda/"  # Adjust this path to your CUDA installation
train_dependencies = [
    "datasets==2.14.5",
    "evaluate==0.4.0",
    "einops==0.6.1",
    "accelerate==0.23.0",
    "transformers==4.33.1",
    "torch==2.0.1",
    "deepspeed==0.9.3",
    "peft==0.4.0",
    "bitsandbytes==0.41.1",
    "loralib==0.1.2",
    "xformers==0.0.21"
]

# Connecting Ray client with the cluster
ray.init(
    address=ray_train_address,
    runtime_env={
        "pip": train_dependencies
    }
)

def load_prepare_dataset():
    print("Loading dataset")
    dataset = load_dataset(dataset_name, split="train")

    # Convert the dataset to a pandas DataFrame
    df = pd.DataFrame(dataset)
    # Rename input and output
    df.rename(columns={'output': 'Response', 'instruction': 'Context'}, inplace=True)
    # Drop other columns
    df.drop(columns=df.columns.difference(['Response', 'Context']), inplace=True)
    # Reorder
    df = df[["Context", "Response"]]
    # Save just 1k samples for demo purpose
    df = df.sample(n=1000, random_state=0)
    # Display the first few rows of the DataFrame
    print(df.head(10))

    dataset = Dataset.from_pandas(df)    
    dataset_prompts = {}
    dataset_prompts['text'] = []

    def generate_prompt(data_point):
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately 
    completes the request.  # noqa: E501
    ### Instruction:
    {data_point["Context"]}
    ### Response:
    {data_point["Response"]}"""

        
    for data_point in dataset:
        prompt = generate_prompt(data_point)
        dataset_prompts['text'].append(prompt)

    # Transform to Ray dataset format
    dataset_prompts_df = pd.DataFrame.from_dict(dataset_prompts)
    dataset_ray = ray.data.from_pandas(dataset_prompts_df)

    return dataset_ray

def prepare_batch_mapper():
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    def preprocess_function(batch):
        ret = tokenizer(list(batch["text"]), padding=True, truncation=True)
        return dict(ret)

    batch_mapper = BatchMapper(preprocess_function, batch_format="pandas")

    return batch_mapper

def trainer_init_per_worker(train_dataset, eval_dataset=None, **train_ray_config):
    # Use the actual number of CPUs assigned by Ray
    os.environ["OMP_NUM_THREADS"] = str(
        session.get_trial_resources().bundles[-1].get("CPU", 1)
    )

    # Enable tf32 for better performance
    torch.backends.cuda.matmul.allow_tf32 = True

    # Loading model
    print("Loading model")
    # TODO: QLoRA with DeepSpeed
    # bnb_config = BitsAndBytesConfig(
    #    load_in_4bit=True,
    #    load_4bit_use_double_quant=True,
    #    bnb_4bit_quant_type="nf4",
    #    bnb_4bit_compute_dtype=torch.bfloat16,
    # )

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        # device_map="auto",
        trust_remote_code=True,
        load_in_8bit=True,
        # quantization_config=bnb_config
    )
    model.config.use_cache = False
    
    # Configuring LoRA
    print("Configuring LoRA")
    # model = prepare_model_for_kbit_training(model)
    model = prepare_model_for_int8_training(model)
    
    lora_config = LoraConfig(
        r=8,
        lora_alpha=16,
        target_modules=["query_key_value"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )

    model = get_peft_model(model, lora_config)

    # Print trainable parameters
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"Trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )
    
    # Training config
    deepspeed = {
        "fp16": {
            "enabled": "auto",
            "initial_scale_power": 8,
        },
        "bf16": {"enabled": "auto"},
        "optimizer": {
            "type": "AdamW",
            "params": {
                "lr": "auto",
                "betas": "auto",
                "eps": "auto",
            },
        },
        "scheduler": {
            "type": "WarmupLR",
            "params": {
                "warmup_min_lr": "auto",
                "warmup_max_lr": "auto",
                "warmup_num_steps": "auto"
            }
        },
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "cpu",
                "pin_memory": True,
            },
            "offload_param": {
                "device": "cpu",
                "pin_memory": True,
            },
            "overlap_comm": True,
            "contiguous_gradients": True,
            "reduce_bucket_size": "auto",
            "stage3_prefetch_bucket_size": "auto",
            "stage3_param_persistence_threshold": "auto",
            "gather_16bit_weights_on_model_save": True,
            "round_robin_gradients": True,
        },
        "gradient_accumulation_steps": "auto",
        "gradient_clipping": "auto",
        "steps_per_print": 10,
        "train_batch_size": "auto",
        "train_micro_batch_size_per_gpu": "auto",
        "wall_clock_breakdown": False,
    }

    # Preparing training arguments
    batch_size = train_ray_config.get("batch_size", 1)
    epochs = train_ray_config.get("epochs", 1)
    warmup_steps = train_ray_config.get("warmup_steps", 0)
    learning_rate = train_ray_config.get("learning_rate", 0.00002)
    weight_decay = train_ray_config.get("weight_decay", 0.01)

    training_args = TrainingArguments(
        output_dir="output",
        per_device_train_batch_size=batch_size,
        logging_steps=10,
        save_strategy="steps",
        learning_rate=learning_rate,
        weight_decay=weight_decay,
        warmup_steps=warmup_steps,
        num_train_epochs=epochs,
        push_to_hub=False,
        disable_tqdm=True,
        bf16=False,
        fp16=False,
        gradient_checkpointing=True,
        deepspeed=deepspeed,
    )

    # Trainer object
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )
    
    return trainer

# Load and prepare dataset
dataset = load_prepare_dataset()
batch_mapper = prepare_batch_mapper()

trainer = TransformersTrainer(
    trainer_init_per_worker=trainer_init_per_worker,
    trainer_init_config={
        "batch_size": 1,
        "epochs": 1,
    },
    scaling_config=ScalingConfig(
        num_workers=num_workers,
        use_gpu=use_gpu,
        resources_per_worker={"GPU": 1, "CPU": cpus_per_worker},
    ),
    datasets={"train": dataset},
    preprocessor=batch_mapper,
    run_config=RunConfig(storage_path=storage_path),
)

# Train
ft_model = trainer.fit()

In [None]:
s3_client = boto3.client("s3")
s3_client.upload_file("./train_script.py", bucket, "scripts/train_script.py")

In [None]:
ray_client = JobSubmissionClient(f"http://{ray_train_address}:8265")

submission_id = ray_client.submit_job(
    # Entrypoint shell command to execute
    entrypoint=(
        f"rm -rf train_script.py && aws s3 cp s3://{bucket}/scripts/train_script.py train_script.py || true;"
        "chmod +x train_script.py && python train_script.py"
    ),
    runtime_env={
        "pip": train_dependencies
    }
)

In [None]:
ray_client.get_job_info(submission_id)

The training is going to take ~5 minutes (Only for training) to finish with 1000 samples from dataset. The LoRA results are going to be stored in S3 as checkpoint. For a production workload you can change the number of **epochs** and increase the size of the dataset.

## Exploring Ray Node scaling

We are using Karpenter for Node Scaling, our Train cluster only has one worker node Running, but we have enabled `enableInTreeAutoscaling: "True"` on the Ray cluster Helm values, it means that Ray will scale the number of nodes defined on the variable `num_workers = 10`, open a new terminal e execute the following commands to explore the scaling:

Get Ray Cluster Pods:

```bash
kubectl get pods -n ray-cluster-train
```

You should see `9` pods in `Peding` state, it mean that Kube Scheduler haven't found any Node available, so `Karpenter` has started to scale, verifying Node Provisioning:

```bash
kubectl get nodes -l provisioner=gpu-train
```

We are using `NVIDIA GPU Operator` in order to configure our Node's dependencies instead of installing components in the AMI itself, take a look in GPU Operator pods provisioned based on the number of nodes:

```bash
kubectl get pods -n gpu-operator
```

> It can several minutes until the Nodes are ready to start the fine tunning process.

## Getting model output folder

Now let's get the output folder where Ray saved the new finetuned model, this folder will be used in the Serving Script

In [None]:
import boto3

# Initialize S3 client
s3_client = boto3.client('s3')

# The root folder where your search starts
root_folder = 'checkpoints/'

# The specific folder you're looking for
target_folder = 'checkpoint_000000/'

def search_folder_in_s3_bucket(bucket, current_folder):
    paginator = s3_client.get_paginator('list_objects_v2')
    for page in paginator.paginate(Bucket=bucket, Prefix=current_folder, Delimiter='/'):
        if 'CommonPrefixes' in page:
            for prefix in page['CommonPrefixes']:
                folder_key = prefix['Prefix']
                if folder_key.endswith(target_folder):
                    print(f"MODEL_PATH: {folder_key}")
                    return True
                # This is a folder, search within it
                if search_folder_in_s3_bucket(bucket, folder_key):
                    return True
    return False

# Run the function
if not search_folder_in_s3_bucket(bucket, root_folder):
    print(f"Folder {target_folder} not found under folder {root_folder}.")

## Serve fine-tuned model with Ray Operator

In [None]:
%%writefile serve_script.py
import os
import boto3
import pandas as pd
import ray
from ray import serve
from starlette.requests import Request

# train_dependencies = [
#     "awscli",
#     "datasets==2.14.5",
#     "evaluate==0.4.0",
#     "einops==0.6.1",
#     "accelerate==0.23.0",
#     "transformers==4.33.1",
#     "torch==2.0.1",
#     "deepspeed==0.9.3",
#     "peft==0.4.0",
#     "bitsandbytes==0.41.1",
#     "loralib==0.1.2",
#     "xformers==0.0.21"
# ]
# ray.init(
#     address="auto",
#     namespace="serve",
#     runtime_env={
#         "pip": train_dependencies
#     }
# )
# serve.start(detached=True)

@serve.deployment(ray_actor_options={"num_gpus": 1})
class PredictDeployment:
    def __init__(self, model_id: str):
        import os
        import boto3
        import torch
        from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
        from peft import PeftModel
        
        print("Downloading checkpoint from S3")
        # Initialize a Boto3 S3 client
        s3 = boto3.client('s3')

        # Specify the S3 bucket name and folder name you want to download
        bucket_name = '<REPLACE HERE WITH YOUR BUCKET NAME CREATED BY TERRAFORM>'
        folder_name = '<REPLACE HERE WITH THE OUTPUT FOLDER>'

        # Specify the local directory where you want to save the downloaded files
        local_directory = "local_model"

        # List objects in the S3 folder
        objects = s3.list_objects_v2(Bucket=bucket_name, Prefix=folder_name)

        # Ensure the local directory exists
        os.makedirs(local_directory, exist_ok=True)

        # Loop through the objects and download them
        for obj in objects.get('Contents', []):
            key = obj['Key']
            local_file_path = os.path.join(local_directory, os.path.basename(key))
            s3.download_file(bucket_name, key, local_file_path)
            print(f'Downloaded: {key} to {local_file_path}')
        print("Checkpoint downloaded")
        
        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            trust_remote_code=True,
            load_in_8bit=True,
        )
        
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        self.tokenizer.pad_token = self.tokenizer.eos_token
        
        # PEFT fine-tuning
        PeftModel.from_pretrained(self.model, local_directory)

        self.config = GenerationConfig(
            temperature=0.7,
            top_p=0.9,
            num_beams=4,
            include_prompt_in_result=False,
        )

    def generate(self, prompt, params):
        inputs = self.tokenizer(prompt, return_tensors="pt")
        input_ids = inputs.input_ids.to(self.model.device)
        self.config.temperature = params["temperature"]
        self.config.top_p = params["top_p"]
        self.config.num_beams = params["num_beams"]

        generation_output = self.model.generate(
            input_ids,
            generation_config=self.config,
            max_new_tokens=params['max_tokens'],
            return_dict_in_generate=True,
            output_scores=False
        )
        
        answer=[]
        for seq in generation_output.sequences:
            output = self.tokenizer.decode(seq, skip_special_tokens=True)
            answer.append(output.split("### Answer:")[-1].strip())
        
        return answer[0]

    async def __call__(self, http_request: Request) -> str:
        json_request: str = await http_request.json()
        prompt = json_request["prompt"]
        params = json_request["params"]
        return self.generate(prompt, params)

# Deploy
model_id = "tiiuae/falcon-7b"
deployment_finetuned = PredictDeployment.bind(model_id=model_id)
# serve.run(deployment_finetuned)

In [None]:
s3_client = boto3.client("s3")
s3_client.upload_file("./serve_script.py", bucket, "scripts/serve_script.py")

### Creating ZIP and uploading to Amazon S3

This ZIP file will be used with `Ray Operator` in the `Ray Service` manifest

In [None]:
import boto3
s3_client = boto3.client("s3")

In [None]:
# TODO: zip & pre-signed url
from zipfile import ZipFile

with ZipFile('./falcon_7b_finetuned.zip', 'w') as zip_object:
    zip_object.write('./serve_script.py')

s3_client.upload_file("./falcon_7b_finetuned.zip", bucket, "falcon_7b_finetuned.zip")
presigned_url = s3_client.generate_presigned_url(
    'get_object',
    Params={'Bucket': bucket, 'Key': "falcon_7b_finetuned.zip"},
    ExpiresIn=3600
)

print("Pre-signed URL:", presigned_url)

Now that we have finished out training, and crafted the servcing script, let's move on to Module 2 and learn how to use Ray Service manifest from Ray Operator.

[**2. Serving finetuned model with contextual data using RayOperator**](https://github.com/aws-samples/gen-ai-on-eks/blob/main/modules/2-serving-finetuned-model.md)