# 2.8 Deploying Models
## 🚄 Introduction
After fine-tuning and evaluating the model, the development of the Q&A robot is nearing completion. This lesson will continue to explore how to deploy the model on computing resources, turning it into an accessible application service.
At the same time, we will further introduce common methods for deploying models in the cloud and help you choose the most suitable way to deploy the model based on your own needs.

## 🍁 Course Objectives
Upon completing this lesson, you will be able to:
* Understand how to manually deploy a model
* Learn about common ways to deploy models in the cloud
* Choose the most appropriate way to deploy a model based on your own needs



## 1. Direct Model Invocation (No Deployment Required)

Model deployment is the process of transferring a trained AI model from the development environment to the production environment, enabling it to handle real-time data and provide services for practical applications, thereby serving real users and creating value.

Reviewing sections 2.1 to 2.6 of the course, you have invoked models multiple times (e.g., qwq-32b, qwen-plus), but you did not deploy these models. This is because the models you invoked are pre-deployed models provided by Alibaba Cloud, which are already deployed on Alibaba Cloud servers and can be directly accessed via API.

There are several advantages to directly invoking fully managed API services provided by service providers like Alibaba Cloud:

- **Direct invocation**: No need to deploy the model; simply call the API.
- **Pay-as-you-go billing**: Billed based on token usage, without worrying about resource consumption during model deployment.
- **No operations required**: You don’t need to worry about model deployment and maintenance, such as automatic scaling or model version upgrades—these tasks are handled by the model service provider.

This approach is highly suitable for **early-stage businesses or small-to-medium scale scenarios**, effectively reducing initial investment costs and avoiding wasted idle GPU resources.

**Note**: Direct model invocation is generally subject to "[rate limiting](https://help.aliyun.com/zh/model-studio/user-guide/rate-limit)." For example, when invoking through the Model Studio API, there are limits on the number of calls per minute (QPM) and the number of tokens consumed (TPM). Exceeding these limits will result in failed requests, and you will need to wait until the rate-limiting conditions are lifted before making further calls.

Additionally, if the model has been fine-tuning or the service provider does not yet support it, direct invocation may not meet your needs.

## 2. Deploying the Model in the Test Environment

In Section 2.7 of the course, you fine-tuning a small parameter model (Qwen2.5-1.5B-Instruct) to maintain high accuracy while accelerating inference. Next, you need to deploy this fine-tuning model to provide services.

Deploying a model typically includes downloading the model, writing loading code, and publishing it as an application service that supports API access, which involves significant manual effort. vLLM is an open-source framework specifically designed for large language model (LLMs) inference, which can simplify this process. It quickly deploys models through simple command-line parameters and enhances inference speed and supports high-concurrency requests through memory optimization and caching strategies.

This section will use vLLM to load the model and start the service. The HTTP interface provided by this service is compatible with the OpenAI API, and you can quickly experience the inference capabilities of large language models (LLMs)s by calling interfaces such as /v1/chat/completions.  



### 2.1 Environment Preparation

The experimental environment for this course must be consistent with the fine-tuning chapter in Section 2.7, ensuring that model deployment operations are performed in a GPU environment.

If you have been following the course in the order of the directory, please continue using the PAI-DSW instance launched in Chapter 2.7; if studying this chapter independently, configure the environment according to the preparation steps outlined in Chapter 2.7.  



Please open the Terminal window in the specified directory.

In the directory of this course, **i.e., /mnt/workspace/alibabacloud_acp_learning/ACP/p2_Build\ LLM\ Q&A\ System**, open a new terminal window.

<img src="https://img.alicdn.com/imgextra/i1/O1CN01TboZMt1pwFKnS6Gdx_!!6000000005424-2-tps-1460-1470.png" width="800">

You can enter the `pwd` command in the terminal window to check the directory location of the terminal window. You can switch the terminal window to the directory of this course by executing the following command.

<style>
    table {
      width: 80%;
      margin: 20px; /* Center the table */
      border-collapse: collapse; /* Collapse borders for a cleaner look */
      font-family: sans-serif; 
    }

    th, td {
      padding: 10px;
      text-align: left;
      border: 1px solid #ddd; /* Light gray border */
    }

    th {
      background-color: #f2f2f2; /* Light gray background for header */
      font-weight: bold;
    }

    tr:nth-child(even) { /* Zebra striping */
      background-color: #f9f9f9;
    }

    tr:hover { /* Highlight row on hover */
      background-color: #e0f2ff; /* Light blue */
    }
</style>
<table width="90%">
<tbody>
<tr>
<td>  



```bash
cd /mnt/workspace/alibabacloud_acp_learning/ACP/p2_Build\ LLM\ Q&A\ System
```

</td>
</tr>
</tbody>
</table>  



### 2.2 Deploying Models with vLLM

#### 2.2.1 Deploying Open Source Models

It is recommended to download the **Qwen2.5-1.5B-Instruct** model from either the [ModelScope Model Library](https://modelscope.cn/models) or the [HuggingFace Model Library](https://huggingface.co/models) for deployment purposes. In the following steps, we will use ModelScope as an example.

First, download the model files to your local machine.

In [None]:
!mkdir -p ./model/qwen2_5-1_5b-instruct
!modelscope download --model qwen/Qwen2.5-1.5B-Instruct --local_dir './model/qwen2_5-1_5b-instruct'

After the download is successful, the model file will be saved in the ./model/qwen2_5-1_5b-instruct folder.

<img src="https://img.alicdn.com/imgextra/i3/O1CN01vTzOrP1n0sUaNfdIO_!!6000000005028-2-tps-710-666.png" width="400">  



Next, install the dependencies by running the following command in the terminal window to install vllm (if you encounter version conflicts, you can also try installing vllm==0.6.2).

<style>
    table {
      width: 80%;
      margin: 20px; /* Center the table */
      border-collapse: collapse; /* Collapse borders for a cleaner look */
      font-family: sans-serif; 
    }

    th, td {
      padding: 10px;
      text-align: left;
      border: 1px solid #ddd; /* Light gray border */
    }

    th {
      background-color: #f2f2f2; /* Light gray background for header */
      font-weight: bold;
    }

    tr:nth-child(even) { /* Zebra striping */
      background-color: #f9f9f9;
    }

    tr:hover { /* Highlight row on hover */
      background-color: #e0f2ff; /* Light blue */
    }
</style>
<table width="90%">
<tbody>
<tr>
<td>   

```bash
pip install vllm==0.6.0
```

</td>
</tr>
</tbody>
</table>

After installing vllm, execute the **vllm command** in the terminal window to start a model service.
<table width="90%">
<tbody>
<tr>
<td>  



```bash
vllm serve "./model/qwen2_5-1_5b-instruct" --load-format "safetensors" --port 8000
```

</td>
</tr>
</tbody>
</table>

- vllm serve: Indicates starting the model service.
- "./model/qwen2_5-1_5b-instruct": Refers to the path of the loaded model, usually containing model files, version information, etc.
- --load-format "safetensors": Specifies the format used when loading the model.
- --port 8000: Specifies the port number. If the port is occupied, switch to another one, such as 8100.

After the service starts successfully, the terminal window will display the message **"Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)"**.

<img src="https://img.alicdn.com/imgextra/i2/O1CN01aJBpG11UvEWl0jdOr_!!6000000002579-2-tps-2806-952.png" width=1000>

Please note that closing the terminal window will immediately terminate the service. Subsequent tests and performance evaluations will rely on this service, so do not close this terminal window.

> If you want the service to run continuously in the background without being affected by closing the terminal window, you can use the following command.
> ```bash
> # Run the service in the background, with the service logs stored in vllm.log
> nohup vllm serve "./model/qwen2_5-1_5b-instruct" --load-format "safetensors" --port 8000 > vllm.log 2>&1 &
> ```  



#### 2.2.2 Deploy Fine-tuned Model (Optional)

The fine-tuned model from Section 2.7 is stored by default in the **output** directory. The example below selects the merge model after fine-tuning for deployment. Please open a new terminal window to execute the vllm command.

<style>
    table {
      width: 80%;
      margin: 20px; /* Center the table */
      border-collapse: collapse; /* Collapse borders for a cleaner look */
      font-family: sans-serif; 
    }

    th, td {
      padding: 10px;
      text-align: left;
      border: 1px solid #ddd; /* Light gray border */
    }

    th {
      background-color: #f2f2f2; /* Light gray background for header */
      font-weight: bold;
    }

    tr:nth-child(even) { /* Zebra striping */
      background-color: #f9f9f9;
    }

    tr:hover { /* Highlight row on hover */
      background-color: #e0f2ff; /* Light blue */
    }
</style>
<table width="90%">
<tbody>
<tr>
<td>  



```bash
vllm serve "./output/qwen2_5-1_5b-instruct/v0-202xxxxx-xxxxxx/checkpoint-xxx-merged" --load-format "safetensors" --port 8001
```

</td>
</tr>
</tbody>
</table>

- "./output/qwen2_5-1_5b-instruct/v0-202xxxxx-xxxxxx/checkpoint-xxx-merged": Replace with the actual fine-tuned model path.
- --port 8001: Set a different port number from step 2.1 to avoid port conflicts.  



### 2.3 Test Service Running Status

vLLM supports starting a local server that is compatible with the OpenAI API, meaning it returns results according to the OpenAI API standard.

Send an HTTP request via cURL to test whether the **Qwen2.5-1.5B-Instruct** model service deployed in **2.2.1** can respond normally. If using a fine-tuned model service, ensure that the port number in the request URL is changed from 8000 to 8001.


In [None]:
%%bash
 curl -X POST http://localhost:8000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
         "model": "./model/qwen2_5-1_5b-instruct",
         "messages": [
             {"role": "system", "content": "You are a helpful assistant."},
             {"role": "user", "content": "Please tell me how many gold medals the Chinese team won in total at the 2008 Beijing Olympics?"}
         ]
     }'


A normal response from the aforementioned interface indicates that the service is running properly.

Additionally, it is compatible with the /v1/models interface, which supports viewing the list of deployed models. For more information, please refer to [vLLM-compatible OpenAI API](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#api-reference).  



In [None]:
%%bash
curl -X GET http://localhost:8000/v1/models

### 2.4 Evaluate Service Performance

To evaluate the performance of the deployed model service, a simple HTTP performance testing tool **wrk** is used here to quickly simulate stress testing task task requests and generate reports. Below, we use the stress testing task task of the **POST /v1/chat/completions** interface as an example to demonstrate relevant performance metrics of the service.

First, open a new terminal window and install the dependencies for the stress testing task task tool wrk. Note: The terminal window should be in the directory specified in Step 1.  



```bash
sudo apt update
sudo apt install wrk
```

Next, prepare the Body data required for the POST request. The data is located in the ./resources/2_9/post.lua file, and the content of the file is shown below.  



```bash
wrk.method = "POST"
wrk.headers["Content-Type"] = "application/json"
wrk.body = [[
    {
       "model": "./model/qwen2_5-1_5b-instruct",
       "messages": [
           {"role": "system", "content": "You are a helpful assistant."},
           {"role": "user", "content": "Please tell me how many gold medals the Chinese team won in total at the 2008 Beijing Olympics?"}
       ]
   }
]]
```

Then, execute the wrk pressure test command in the terminal window. Set the concurrency (-c) of the chat interface to 1 and 10 respectively, and set the pressure test duration (-d) to 10 seconds for both. Observe the results of the two experiments.



```bash
wrk -t1 -c1 -d10s -s ./resources/2_9/post.lua http://localhost:8000/v1/chat/completions

wrk -t1 -c10 -d10s -s ./resources/2_9/post.lua http://localhost:8000/v1/chat/completions
```

The wrk stress test results are shown below:

<img src="https://img.alicdn.com/imgextra/i3/O1CN01ybO7TU1X6LJ12FYdV_!!6000000002874-2-tps-1452-322.png" width="500" height="150">
<img src="https://img.alicdn.com/imgextra/i2/O1CN01bberC61txr86CpFjU_!!6000000005969-2-tps-1472-362.png" width="500" height="150">

According to the stress test results, as concurrency increased (1 -> 10), QPS improved by approximately 6 times (3.30 -> 20.08), and the average latency increased by about 30% (324.61ms -> 426.84ms). Notably, in the second stress test, there were 2 timeout errors. This occurred because, under high concurrency, the server's load exceeded its processing capacity, and insufficient performance led to some request timeouts.



## ☁3. Deploying Models on the Cloud

The above stress test results show that, due to the limited computing power of the device where the model is deployed, the model service cannot meet the inference requirements for **low latency** and **high concurrency**.

The traditional solution is to purchase higher-performance "servers" and redeploy the model onto these servers. However, this approach has the following issues:
- **Resource cost**: Requires a one-time purchase of a large number of high-performance "servers".

- **Operational cost**: Daily maintenance of servers, including monitoring, upgrading, troubleshooting, etc., requires advanced technical skills.

- **Reliability**: The stability and reliability of the service depend on the capabilities of the maintenance personnel on one hand, and on the project cost on the other. With limited costs, it is difficult to establish a highly available and reliable model service.

- **Low flexibility**: Limited by inherent hardware resources, it is impossible to dynamically adjust resources according to actual needs, which may lead to insufficient performance of the model service or resource wastage.

<br>

Compared with purchasing servers to deploy models, **using cloud services to deploy models** is often a better choice. Cloud services can provide you with more flexible deployment options. Based on your own capabilities and needs, you can choose from [**Model Studio**](https://help.aliyun.com/zh/model-studio/getting-started/what-is-model-studio), [**Function Compute FC**](https://help.aliyun.com/zh/functioncompute/fc-3-0/product-overview/what-is-function-compute), [**AI Platform PAI-EAS**](https://help.aliyun.com/zh/pai/user-guide/overview-2), [**Elastic GPU Service**](https://help.aliyun.com/zh/egs/what-is-elastic-gpu-service), [**Container Service ACK**](https://help.aliyun.com/zh/ack/product-overview/product-introduction), [**Container Compute Service ACS**](https://help.aliyun.com/zh/cs/product-overview/product-introduction), and other cloud services to achieve scalable, highly concurrent, low-latency, flexible management, and stable services, quickly adapting to business changes.



### 3.1 Deploying Models Using Alibaba Cloud Alibaba Cloud Model Studio

You can use the console page of Alibaba Cloud Model Studio to quickly deploy models. This method is simple and straightforward, and you don't need to master complex model deployment methods to easily have exclusive model services. You can also deploy models via a simple [API deployment model](https://help.aliyun.com/en/model-studio/developer-reference/model-deployment-quick-start).

The deployment process is as follows:

<img src="https://img.alicdn.com/imgextra/i3/O1CN01KkOfZb1HAtQ9sTyop_!!6000000000718-0-tps-2630-278.jpg" width="450">

- **Select Model**: Choose a pre-configured model or a custom model.
    - Pre-configured Model: Standard models supported by Alibaba Cloud Model Studio. Select the appropriate model for deployment based on your needs. Supported models can be viewed when [deploying a new model](https://bailian.console.aliyun.com/?spm=a2c4g.11186623.0.0.63e56cfcXIU4Qj#/efm/model_deploy).

        <img src="https://help-static-aliyun-doc.aliyuncs.com/assets/img/en-US/9723494371/p892458.png" width="300">

    - Custom Model: Models optimized by the Alibaba Cloud Model Studio platform. Refer to: [Optimization-supported models](https://help.aliyun.com/en/model-studio/model-training-on-console?spm=a2c4g.11186623.0.0.63e56cfcMC90g9#a6da1accf0dun).
- **One-click Model Deployment**: The console supports one-click model deployment, and models can also be quickly deployed via API.
- **Using Models Based on Alibaba Cloud Model Studio Ecosystem**: Deployed models can be seamlessly integrated into the Alibaba Cloud Model Studio ecosystem, supporting direct use in the Alibaba Cloud Model Studio console and reuse of Alibaba Cloud Model Studio APIs through HTTP and DashScope.

Model deployment operations can refer to the [Alibaba Cloud Model Studio Model Deployment](https://help.aliyun.com/en/model-studio/user-guide/model-deployment) documentation.

Although deploying models through Alibaba Cloud Model Studio can greatly reduce the difficulty of model deployment and maintenance, due to the limited types of models supported by Alibaba Cloud Model Studio, if your model is not within the supported range, you can deploy it using the following methods.



### 3.2 Deploying Models Using Function Compute (FC)

Function Compute (FC) supports the deployment of a wider range of model types. FC provides Serverless GPU services, eliminating the need to manage underlying resources, with automatic scaling in seconds, and a pay-as-you-go model that can significantly reduce costs for infrequently used models, especially suitable for temporary tasks with high computational resource requirements.

However, deploying models using Function Compute is not without its drawbacks:
- Cold start latency: If no requests are received for a period of time, the function may enter a "cold" state. When a new invocation request is received, the instance needs to be restarted, which may result in longer initial response times.
- Increased debugging difficulty: Applications based on functions may be harder to debug and monitor. It can be challenging to pinpoint issues in multi-step processing workflows.

In summary, deploying models using Function Compute (FC) is highly suitable for lightweight inference tasks and low-frequency access scenarios with less stringent real-time requirements (e.g., offline batch processing, scheduled or event-triggered tasks).

However, if your task scenario has **high real-time** requirements or requires enhanced monitoring and debugging for complex model inference, you can try the following centralized methods for model deployment.

**Deployment Reference**: You can [one-click deploy the QwQ-32B inference model](https://help.aliyun.com/zh/functioncompute/fc-3-0/use-cases/two-ways-to-quickly-deploy-and-experience-qwq-32b-reasoning-model) to experience the deployment capabilities provided by Function Compute. For more deployment practices, see the [Function Compute 3.0 - Practical Tutorials](https://help.aliyun.com/zh/functioncompute/fc-3-0/use-cases/?spm=a2c4g.11186623.help-menu-2508973.d_3.228e493fj6un1Y&scm=20140722.H_2509019._.OR_help-V_1).  



### 3.3 Deploying Models Using PAI-EAS

You can also deploy models downloaded from open-source communities or trained by yourself as online services via the model online service (EAS) of the Artificial Intelligence Platform PAI. It provides features such as elastic scaling, blue-green deployment, resource group management, version control, and resource monitoring to help you better manage model applications.

EAS is highly suitable for **real-time synchronous inference** scenarios. To address the issue of long initial request latency, EAS offers a model warm-up feature, which pre-warms the model service before going live, enabling the model service to enter a normal service state immediately after deployment.

Compared to Function Compute, PAI-EAS may have higher fixed costs. For low-frequency usage scenarios, it might not be as cost-effective as Function Compute FC. You can try using the Spot Instance mode to help save costs. For more details, see: [PAI-EAS Spot Best Practices](https://help.aliyun.com/zh/pai/use-cases/pai-eas-spot-best-practices).

**Deployment Reference**: You can refer to [Deploy large language model Applications with EAS in 5 Minutes](https://help.aliyun.com/zh/pai/use-cases/use-pai-eas-to-quickly-deploy-tongyi-qianwen?spm=a2c4g.11186623.0.i0#ba6b53303bb66) to quickly deploy a general-purpose model and immediately experience the model service capabilities provided by PAI-EAS. If you wish to deploy a custom model, it is recommended to refer to [How to Mount Custom Models?](https://help.aliyun.com/zh/pai/use-cases/deploy-llm-in-eas#c1d769ba33kh5).  



### 3.4 Deploying Models Using Elastic Computing Service or Container Services

Deploying models through Elastic Computing Service is a relatively common deployment method, allowing complete control over server configurations, operating systems, and environment settings. This is particularly useful for models that require high levels of customization and specific dependencies.

At the same time, ECS provides stable computing resources without the cold start delay issues associated with function computing. ECS can be combined with the Elastic Scaling Service (ESS) to achieve elastic scaling of instances, and with load balancers (such as SLB) to ensure high availability and load balancing. Security groups, access control, and data encryption can also be used to ensure the security of data and services.

However, configuring and managing these features requires certain skills and experience, leading to higher maintenance costs.
- **Suitable Scenarios**: LLMs requiring high levels of customization, stable performance, and long-term operation; enterprises with high requirements for cost predictability and resource control.
- **Unsuitable Scenarios**: Small projects requiring rapid deployment and elastic scaling; teams sensitive to operational complexity and with limited resources.

**Deployment Reference**: You can refer to [Using vLLM Container Image to Quickly Build a Large Language Model Inference Environment on GPU](https://help.aliyun.com/zh/egs/use-cases/use-a-vllm-container-image-to-run-inference-tasks-on-a-gpu-accelerated-instance) for specific operations. If the model you are deploying is Llama, ChatGLM, Baichuan, Qwen-Max, or their fine-tuned models, it is recommended to [Install and Use DeepGPU-LLM for Model Inference](https://help.aliyun.com/zh/egs/developer-reference/install-and-use-deepgpu-llm-for-model-inference?spm=a2c4g.11186623.0.i6) to accelerate model inference capabilities.



If your organization has already accumulated experience in model deployment based on containers, you can also use ACK combined with GPU cloud server nodes without learning too many new concepts.

Additionally, you can consider using ACS, which helps you directly access GPU-powered containers within the familiar Kubernetes container cluster environment, while eliminating the need to focus on cluster operations and maintenance.

**Deployment Reference**:    
- ACK: [Deploy DeepSeek Distillation Model Inference Service Based on ACK](https://help.aliyun.com/zh/ack/cloud-native-ai-suite/use-cases/deploy-deepseek-distillation-model-inference-service-based-on-ack)     
- ACS: [Build QwQ-32B Model Inference Service Using ACS GPU Computing Power](https://help.aliyun.com/zh/cs/user-guide/build-qwq-32b-model-inference-service-using-acs-gpu-computing-power)  



### 3.5 Cloud Service Solution Comparison and Decision Recommendations

When deploying models on Alibaba Cloud, choosing different services requires comprehensive consideration of factors such as **business requirements**, **model characteristics**, **technical capabilities**, **operations complexity**, and **costs**. The following is a comparative analysis of several common cloud service deployment methods and recommendations for selection:

| Service Name | Features | Applicable Scenarios |
| --- | --- | --- |
| Model Studio | A dedicated platform for LLMs, providing one-click deployment, model optimization, API call management, and encapsulating underlying complexities. | Rapid deployment of LLMs (e.g., Qwen series), without the need to focus on infrastructure. |
| Function Compute (FC) | Serverless architecture, billed by request volume, with automatic scaling and no operations required. | Suitable for lightweight inference tasks and low-frequency access scenarios (e.g., scheduled tasks, event triggers). |
| PAI-EAS | An online model serving platform that supports custom model deployment, elastic scaling, monitoring, and other capabilities. | Medium and small deep learning models (e.g., image classification, NLP), requiring elastic scaling and fine-grained resource management. |
| Elastic GPU Service | IaaS-level resources, flexible installation of any framework and dependencies, with manual operations and maintenance required. | Custom model training/inference, requiring full control over the environment (e.g., complex dependencies, special hardware needs). |
| Container Service ACK/Container Compute Service ACS | Kubernetes cluster deployment, integrating CI/CD, automatic scaling, load balancing. | Complex microservice architectures, mixed workloads, large-scale distributed inference or training. |

Model Deployment Service Selection Recommendations:
1. What are your core requirements?
    - Rapid deployment of LLMs → Model Studio (e.g., conversational bots, generative AI).
    - Low-cost lightweight services/low-frequency non-real-time tasks → Function Compute FC (e.g., hundreds of queries per day for small tools).
    - Conventional model deployment (image/text/NLP) → PAI-EAS (balancing performance and usability).
    - Custom environments or complex dependencies → Elastic GPU Service or ACK.

2. Service Deployment Model Compatibility
    - For Tongyi series models, prioritize: Model Studio.
    - General-purpose models: Function Compute FC, PAI-EAS, Elastic GPU Service (supporting TensorFlow/PyTorch/ONNX ecosystems), containerized deployment (ACK/ACS).

3. Operations Complexity and Team Technical Capabilities?
    - No operations: Non-technical teams → Model Studio (visual operation).
    - Low operations complexity: Algorithm engineers → PAI-EAS, development teams → Function Compute FC.
    - High operations complexity: Mature DevOps teams → ACK (requires maintaining complex pipelines) or Elastic GPU Service (requires managing the environment manually).

4. Cost Control
    - Low-cost lightweight scenarios: Function Compute FC (billed by request count and resource consumption, no idle costs).
    - Moderate cost: PAI-EAS (billed by instance specifications and duration, suitable for stable traffic, can be optimized through scaling).
    - High cost but flexible: Elastic GPU Service (pay-as-you-go/annual/monthly subscription, requires optimizing resource utilization manually).
    - Comprehensive higher cost: ACK (involves cluster management fees and resource scheduling complexity).  



## ✅Summary of this section

This section provides a detailed introduction to the basic methods of model deployment. You have learned:

- The practical steps for deploying a model, that is, how to deploy a model as an accessible model inference service. The deployed model can be an open-source model or a fine-tuned model.

- Deploying a model is not mandatory. You can directly call fully managed API services provided by service providers (such as Alibaba Cloud Client) to effectively reduce initial business investment costs and avoid wasting idle GPU resources.

- Choose different cloud services (such as Model Studio, Function Compute FC, PAI-EAS, Elastic Computing Service, ACK/ACS, etc.) to deploy models based on your own needs and capabilities, thereby achieving a balance between business requirements and resource utilization.

Through this course, you have mastered the basic methods of model deployment, laying a solid foundation for building high-performance and scalable LLM applications.

Next, you will further learn how to better ensure the availability, security, and performance of models in the production process of LLM applications.

>⚠️ **Note**: After completing this section, please stop the current PAI-DSW GPU instance in time to avoid additional charges.

### Further Reading

This course introduces cloud deployment, which can be divided into public cloud deployment and private cloud deployment in actual deployment.

- **Public Cloud Deployment**: Encapsulate the model as an API for users to call, similar to the SaaS model. This method lowers the usage threshold and facilitates integration, but it is necessary to ensure the stability and security of the API.

- **Private Cloud Deployment**: Set up a private cloud platform within the enterprise and deploy the model on the private cloud. It provides higher data security and control, supports customization, but requires a certain amount of maintenance cost.

In addition, edge-cloud collaborative deployment is also a common deployment method.

**Edge-cloud Collaborative Deployment** combines the advantages of cloud and edge deployment. Some calculations are performed on edge devices, while complex computing tasks are uploaded to the cloud for processing, enabling the handling of complex computing tasks while ensuring user experience.

Edge-cloud collaborative deployment is suitable for scenarios that require real-time response but have limited computing resources, such as the Rakuten collaboration with the Tongyi LLM to create an end-side companion intelligent voice robot, as shown in the figure below. Here, "end" refers to small models deployed on the client side, and "cloud" refers to LLMs deployed in the cloud. These small models need to be fine-tuned to adapt to different client devices and operate in the same environment as the client application, responsible for preliminary data preprocessing and other simple tasks. Then, the processed information is transmitted to the cloud, where the LLM performs in-depth processing to quickly and efficiently respond to end-side user needs.

<img src="https://img.alicdn.com/imgextra/i2/O1CN015VBCB524nk9sYPd8s_!!6000000007436-0-tps-1662-474.jpg" width="650">

In some specific scenarios, **embedded system deployment** is a more appropriate choice, such as in cars, robots, and medical devices. This method deploys the model on a hardware platform, enabling real-time control and decision-making, but requires higher optimization of both the model and the hardware.

Therefore, when facing actual business needs, you need to consider the system's performance requirements, data privacy and security, and the complexity of implementation to ensure the efficiency and sustainability of the deployment solution.

## ✉️ Evaluation and Feedback
Thank you for studying the Alibaba Cloud LLM ACP Certification course. If you think there are parts of the course that are well-written or need improvement, we look forward to your [evaluation and feedback through this questionnaire](https://survey.aliyun.com/apps/zhiliao/Mo5O9vuie).

Your criticism and encouragement are both our motivation to move forward.  

