# 2.7 Enhancing Model Capabilities through Fine-tuning  



## 🚄 Preface

In the previous lessons, we introduced how to build a Q&A bot and attempted to enhance its capabilities by optimizing prompt, constructing RAG chatbot, and extending plugins. However, you may have noticed that you've been "patching" around the model—these methods essentially enhance the model's performance through external tools, while the model's inherent knowledge boundaries and reasoning abilities remain fundamentally unchanged. This section will take you into the "internal training ground" of large language models (LLMs), directly improving the model’s underlying capabilities through fine-tuning techniques.

When facing in-depth needs in specific domains, such as precise parsing of elementary school math problems, relying on prompt engineering and RAG chatbot often falls short. For details like operator precedence rules or unit conversion logic in word problems, the model needs to establish a structured knowledge system. This is where fine-tuning shows its unique advantages—by "targeted feeding" the model with math problem-solving examples generated by DeepSeek-R2, you can enable the model to learn DeepSeek-R2's knowledge in mathematics, grasp mathematical thinking paradigms, and even independently discover problem-solving patterns.



## 🍁 Course Objectives

After completing this course, you will be able to:

* Learn and understand the core principles and implementation logic of fine-tuning large language models (LLMs).
* Combine training principles to master the methodology for optimizing key training parameters.
* Independently complete the fine-tuning of models, learn about potential issues that may arise, and practice various solutions.



## 0. Environment Preparation

Since fine-tuning models requires high hardware performance, it is recommended to use Platform for AI's Data Science Workshop to create an instance equipped with a GPU, allowing you to complete the fine-tuning tasks more efficiently.

> If you do not have a local GPU environment or your GPU memory is less than 30GB, it is not recommended to run this course locally, as the code may fail to execute.

Please refer to "[1_0_Setup_Computing_Environment](https://edu.aliyun.com/course/3130200/lesson/343310285)" under `Step 1: Create a PAI DSW Instance` to create a new instance, with the following instructions:

1. Ensure that the new instance has a **different name** from any previously created instance, such as: acp_gpu  
2. For **resource specifications**, select `ecs.gn7i-c8g1.2xlarge` (this specification includes **one A10 GPU with 30GB of memory**).
<img src="https://img.alicdn.com/imgextra/i4/O1CN01L3iYeb1MRuEvXhhcD_!!6000000001432-2-tps-2984-1582.png" width="800">  
3. For the **image**, choose `modelscope:1.21.0-pytorch2.4.0-gpu-py310-cu124-ubuntu22.04` (you need to switch the "Image Configuration" -> "Chip Type" to GPU).

After the instance is successfully created and its status is `Running`, enter the following command in the `Terminal` to obtain the ACP course code:

    ```bash
    git clone https://github.com/AlibabaCloudDocs/aliyun_acp_learning.git
    ```

Reopen this chapter in the `Notebook` of the newly created GPU instance and continue learning the subsequent content.<br>

Install the following dependencies:


In [1]:
# The following dependencies need to be installed
%pip install accelerate==1.0.1 rouge-score==0.1.2 nltk==3.9.1 ms-swift[llm]==2.4.2.post2 evalscope==0.5.5rc1

zsh:1: no matches found: ms-swift[llm]==2.4.2.post2
Note: you may need to restart the kernel to use updated packages.


## 1. Task Design

How to solve mathematical problems has always been an important direction in the development of large language models (LLMs), and it just so happens that your intelligent assistant also needs to have basic computational capabilities. To facilitate fine-tuning of the model, you can select a small-parameter open-source model `qwen2.5-1.5b-instruct` as your base model.

First, you need to download the model and load it into memory:

In [2]:
# Download model parameters to the ./model directory
!mkdir ./model
!modelscope download --model qwen/Qwen2.5-1.5B-Instruct --local_dir './model'

from swift.llm import (
    get_model_tokenizer, get_template, ModelType,
    get_default_template_type
)
import torch

# You can modify the query (model input) according to your needs

# Obtain model information
model_type = ModelType.qwen2_5_1_5b_instruct
template_type = get_default_template_type(model_type)
# Set the local model location
model_id_or_path = "./model"
# Initialize the model and input/output formatting template
kwargs = {}
model, tokenizer = get_model_tokenizer(model_type, torch.float32, model_id_or_path=model_id_or_path, model_kwargs={'device_map': 'cpu'}, **kwargs)
model.generation_config.max_new_tokens = 128
template = get_template(template_type, tokenizer, default_system='')
print("Model initialization completed")

Downloading [config.json]: 100%|█████████████████| 660/660 [00:02<00:00, 224B/s]
Downloading [configuration.json]: 100%|███████| 2.00/2.00 [00:01<00:00, 1.18B/s]
Downloading [generation_config.json]: 100%|██████| 242/242 [00:01<00:00, 209B/s]
Downloading [LICENSE]: 100%|███████████████| 11.1k/11.1k [00:01<00:00, 5.90kB/s]
Downloading [merges.txt]: 100%|████████████| 1.59M/1.59M [01:12<00:00, 23.2kB/s]
Downloading [model.safetensors]: 3.09GB [45:28, 1.22MB/s]                       
Downloading [README.md]: 100%|█████████████| 4.80k/4.80k [00:02<00:00, 1.70kB/s]
Downloading [tokenizer.json]: 100%|████████| 6.71M/6.71M [00:06<00:00, 1.15MB/s]
Downloading [tokenizer_config.json]: 100%|█| 7.13k/7.13k [00:01<00:00, 5.51kB/s]
Downloading [vocab.json]: 100%|█████████████| 2.65M/2.65M [00:05<00:00, 495kB/s]


  from .autonotebook import tqdm as notebook_tqdm
[INFO:swift] Successfully registered `/usr/local/lib/python3.10/site-packages/swift/llm/data/dataset_info.json`
2025-07-15 12:12:31,882	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
[INFO:swift] Loading the model using model_dir: ./model
[INFO:swift] model_kwargs: {'device_map': 'cpu'}
[INFO:swift] model.max_model_len: 32768


Model initialization completed


You can directly try its effect on math problems (the answer is: 648 kilograms of radishes can be harvested):  



In [None]:
from swift.llm import inference
from IPython.display import Latex, display

math_question = "In a triangular vegetable field with a base of 18 meters and a height of 6 meters, radishes are planted. If 12 kilograms of radishes are harvested per square meter, how many kilograms of radishes can be harvested from this field?"
query = math_question
response, _ = inference(model, template, query)
print(query)
print("The correct answer is: 648 kilograms of radishes can be harvested")
print('-----------LLM response-------------')
display(Latex(response))
print('------------End of response--------------')

It can be observed that your model does not seem to be able to accurately compute this simple mathematical problem. The model knows the formula for the area of a triangle but fails to use this knowledge to accurately calculate the weight of the radish.

Of course, the effect of using RAG is the same. From previous learning, you know that RAG is more like an open-book exam. However, you have never seen an open-book math exam improve scores because the core of improving math ability lies in enhancing students' logical reasoning and computational skills rather than knowledge retrieval.

Therefore, to directly enhance the ability of your Q&A bot on simple mathematical problems, you must use model fine-tuning to improve the model’s logical reasoning ability. (Computational ability can be enhanced by introducing a "calculator" plugin.)  



## 2. Fine-tuning Principles

### 2.1 How Models Learn

#### 2.1.1 Machine Learning - Finding Patterns Through Data

In traditional programming work, you usually know the explicit rules and write these rules into functions, such as: $f(x) = ax$.

Here, a is a known deterministic value (also called a parameter or weight). This function represents a simple algorithmic model that can compute (predict) the output $y$ based on the input $x$.

However, in real-world scenarios, it's more likely that you don't know the explicit rules (parameters) beforehand but may have some observed phenomena (data).

The goal of machine learning is to help you use this data (training set) to try and find (learn) these parameter values, a process known as training the model.

#### 2.1.2 loss function & Cost Function - Quantifying Model Performance

To find the most suitable parameters, you need a way to measure whether the currently tested parameters are appropriate.

For better understanding, assume you now need to evaluate whether the parameter a in the model $f(x) = ax$ is suitable.

##### loss function

You can assess the model's performance on a single data point $x_i, y_i$ by subtracting the predicted result $f(x_i)$ from the actual result $y_i$ for each sample $x_i$ in the training set. The function used to evaluate this error is called the loss function (or error function): $L(y_i, f(x_i)) = y_i - ax_i$.

Directly calculating the difference might yield positive or negative values, which could cancel each other out when aggregating losses, underestimating the total loss. To address this issue, you can consider squaring the difference as the loss: $L(y_i, f(x_i)) = (y_i - ax_i)^2$. Additionally, squaring amplifies the impact of errors, helping you identify the most suitable model parameters.

> In practical applications, different models may use different calculation methods as the loss function.

##### Cost Function

To evaluate the model's overall performance across the entire training set, you can calculate the average loss of all samples (i.e., mean squared error). This function, used to assess the model's overall performance across all training samples, is called the Cost Function (or cost function).

For a training set with m samples, the cost function can be expressed as: $J(a) = \frac{1}{m} \sum_{i=1}^{m} (y_i - ax_i)^2$.

> In practical applications, different models may also choose different calculation methods as the Cost Function.

With the Cost Function, the task of finding suitable model parameters can be equated to finding the minimum value of the Cost Function (i.e., the optimal solution). Finding the minimum value of the Cost Function means that the corresponding parameter a value is the most suitable model parameter value.

If you plot the Cost Function, the task of finding the optimal solution essentially involves finding the lowest point on the curve or surface.
<div style="text-align: center">
<img src="https://img.alicdn.com/imgextra/i4/O1CN0149XTTS1WUKSTtpeoh_!!6000000002791-2-tps-2314-1682.png" style="width: 400px; display: block; margin-left: auto; margin-right: auto"/>
</div>

> In real projects, people often interchangeably use the terms cost function and loss function. In subsequent content and code, we will follow this engineering convention and refer to the cost function as the loss function (loss function).

#### 2.1.3 Gradient Descent Algorithm - Automatically Finding the Optimal Solution

In the previous curve, you can visually identify the lowest point. However, in practical applications, models typically have many parameters, and their Cost Functions are often complex surfaces in high-dimensional spaces, making it impossible to find the optimal solution through direct observation. Therefore, you need an automated method to find the optimal parameter configuration.

Gradient descent is one of the most common methods. A typical implementation of gradient descent starts by randomly selecting a starting point on the surface (or curve), then continuously making small adjustments to the parameters until the lowest point (corresponding to the optimal parameter configuration) is found.

<div style="text-align: center">
<img src="https://img.alicdn.com/imgextra/i2/O1CN01ihhR9Y1IbkFZTQ3bV_!!6000000000912-1-tps-1080-810.gif" style="width: 400px;margin-left: auto; margin-right: auto"/>
<img src="https://img.alicdn.com/imgextra/i3/O1CN01meUISA1dHgq2mqm6V_!!6000000003711-1-tps-1080-810.gif" style="width: 400px;margin-left: auto; margin-right: auto"/>
</div>

When training a model, you need the training program to automatically adjust the parameters so that the value of the Cost Function approaches the lowest point. Therefore, the gradient descent algorithm must automatically control two aspects: the direction of parameter adjustment and the magnitude of parameter adjustment.

##### Direction of Parameter Adjustment

If the Cost Function is a U-shaped curve, you can intuitively see that the parameter adjustment should move in the direction where the absolute value of the slope decreases, i.e., towards a flatter area.
<div style="text-align: center">
<img src="https://img.alicdn.com/imgextra/i2/O1CN01ME3u6G203FVsQsmLe_!!6000000006793-2-tps-1608-1244.png" style="width: 400px; display: block; margin-left: auto; margin-right: auto"/>
</div>

If the Cost Function is a surface in a three-dimensional coordinate system, the direction of parameter adjustment should similarly move towards flatter areas. However, at a certain point on the surface, there are multiple possible descending directions. To find the lowest point as quickly as possible, you should move in the steepest direction.
<div style="text-align: center">
<img src="https://img.alicdn.com/imgextra/i2/O1CN01Uh8OxI1mqnkBHqMjH_!!6000000005006-1-tps-664-684.gif" style="width: 400px; display: block; margin-left: auto; margin-right: auto"/>
</div>

In mathematics, the gradient points in the direction of the steepest ascent from a point on the surface, and its opposite direction is the steepest descent.

To find the lowest point on the surface in the shortest time, the direction of parameter adjustment should be along the opposite direction of the gradient, i.e., the green arrow direction in the two figures above.

> For a curve f(a) in a two-dimensional coordinate system, the gradient at a point is the slope at that point. 
> For a surface f(a,b) in a three-dimensional coordinate system, the gradient at a point is a two-dimensional vector composed of the slope values in the a and b axis directions. This indicates the rate of change of the function in each input variable direction and points in the direction of the fastest growth. Calculating the slope of a point on the surface in a particular axis direction is also referred to as taking the partial derivative.

##### Magnitude of Parameter Adjustment

After determining the direction of parameter adjustment, the magnitude of the adjustment needs to be determined.

Adjusting parameters with a fixed step size is the easiest approach, but this may prevent you from ever finding the lowest point, causing oscillation near the lowest point instead.

For example, in the figure below, adjusting parameters with a fixed step size of 1.5 results in oscillation around the lowest value, unable to further approach the lowest point.

<div style="text-align: center">
<img src="https://img.alicdn.com/imgextra/i3/O1CN01y7FatQ27bKI9CYCJ1_!!6000000007815-1-tps-938-646.gif" style="width: 400px; display: block; margin-left: auto; margin-right: auto"/>
</div>

To avoid this issue, the adjustment magnitude should be reduced as you approach the lowest point. The closer you get to the lowest point, the smaller the slope becomes. Therefore, instead of using a fixed step size, you can use the slope at the current position as the adjustment magnitude.

<div style="text-align: center">
<img src="https://img.alicdn.com/imgextra/i3/O1CN01h45Ifb1xRZhXXIXEC_!!6000000006440-1-tps-892-618.gif" style="width: 400px; display: block; margin-left: auto; margin-right: auto"/>
</div>

However, some Cost Function curves are very steep, and directly using the slope may still cause oscillation around the lowest point. To address this, you can multiply the slope by a coefficient to regulate the step size. This coefficient is called the learning rate.

The choice of learning rate is particularly important for training effectiveness and efficiency:

<div style="display: flex; justify-content: space-between; gap: 2px; padding: 15px; background:rgba(0,0,0,0)">
    <!-- Column 1 -->
    <div style="flex: 1; padding: 10px; border: 1px solid #ddd; border-radius: 5px">
        <p style="margin-top: 10px">An appropriate learning rate allows you to find suitable parameters in a relatively short time.</p>
        <img src="https://img.alicdn.com/imgextra/i3/O1CN01NrvVfj1sCqtKHLyia_!!6000000005731-2-tps-1680-1224.png" style="width: 100%; height: auto; border-radius: 3px"/>
    </div>
    <!-- Column 2 -->
    <div style="flex: 1; padding: 10px; border: 1px solid #ddd; border-radius: 5px">
        <p style="margin-bottom: 10px">An excessively low learning rate, while capable of finding suitable parameters, leads to greater time and resource consumption.</p>
        <img src="https://img.alicdn.com/imgextra/i1/O1CN015dbcz61MCn8LkN2Ta_!!6000000001399-2-tps-1728-1300.png" style="width: 100%; height: auto; border-radius: 3px"/>
    </div>
    <!-- Column 3 -->
    <div style="flex: 1; padding: 10px; border: 1px solid #ddd; border-radius: 5px">
        <p style="margin-bottom: 10px">An excessively high learning rate may cause you to skip the optimal solution, ultimately failing to find the lowest point.</p>
        <img src="https://img.alicdn.com/imgextra/i1/O1CN01l4leTB1LKI0BcVs16_!!6000000001280-2-tps-1658-1262.png" style="width: 100%; height: auto; border-radius: 3px"/>
    </div>
</div>
</div>

A smaller learning rate, although it will consume a lot of computational resources and time, actually helps you approach the lowest point more closely. In practical model training engineering, attempts are also made to dynamically adjust the learning rate. For example, in Model Studio's model fine-tuning feature, there is a [learning rate adjustment strategy](https://help.aliyun.com/zh/model-studio/user-guide/using-fine-tuning-on-console#7864d6a606ztg), which allows you to configure linear decay of the learning rate or decay according to a curve. Alibaba Cloud's PAI also provides an [AutoML](https://help.aliyun.com/zh/pai/user-guide/automl/) tool that can help you automatically find a more suitable learning rate.

#### 2.1.4 More parameters used in model training engineering

##### batch size

In the process of finding the lowest point of the Cost Function, each calculation of the gradient (the slope in each direction) and then updating the model parameters based on that gradient, preparing for the next calculation and update, is called a training step.

In previous introductions, each training step calculates the gradient at a certain point and then updates the parameters. You can also set the batch size to n, averaging the gradients based on n samples (mini-batch) to update the parameters.

A larger batch size can accelerate the training process, but it will also consume more resources, and an excessively large batch size may lead to issues such as reduced model generalization performance.

Choosing an appropriate batch size is a balancing act, depending on available hardware resources, training time, and desired model performance. In practice, experiments are often needed to determine the most suitable batch size for a specific task.

##### eval steps

Because the training set is usually very large, people typically do not use the validation set for evaluation after a full iteration over the training set. Instead, they choose to evaluate using the validation set after a certain number of training steps. This interval is usually controlled by the eval_steps parameter.

##### epoch

A complete iteration over the training set is called an epoch. In actual training, you cannot guarantee finding the optimal solution (lowest point) of the Cost Function within one epoch. Therefore, many training frameworks support configuring the number of training epochs, such as the num_train_epochs parameter provided in the swift training framework.

A too small epoch value may result in not finding the optimal model parameters by the end of training. A too large epoch value can lead to excessively long training times and resource waste.

A common method for finding a suitable epoch is early stopping: before starting training, you do not preset an epoch value (or set a larger value). During training, you periodically evaluate the model's performance using the validation set. When the model's performance on the validation set no longer improves (or starts to decline), training is automatically stopped.

Of course, early stopping is not the only solution. There are many other methods in the industry to determine a suitable epoch value, such as dynamically adjusting the learning rate based on changes in validation set loss to indirectly affect the number of training epochs.

#### 2.1.5 Neural Network - Universal Complex Function Approximator

**Problems faced in machine learning:**

In text generation tasks, the input $x$ and output $y$ generally have very high dimensions, making it impossible to directly discern the underlying patterns. What should you do?

Smart mathematicians found a **universal function approximator — neural network (multi-layer)**, which has become the foundation of current complex machine learning tasks.

<div style="text-align: center">
<img src="https://img.alicdn.com/imgextra/i1/O1CN01QRD5MH1rwMdJHBzxi_!!6000000005695-2-tps-1080-533.png" style="width: 400px; display: block; margin-left: auto; margin-right: auto"/>
</div>

One layer of a neural network is generally expressed as $Y=σ(W⋅X)$, where the uppercase input $X$ and output $Y$ indicate they are multi-dimensional, $σ$ is the activation function, and $W$ represents the parameters of the assumed function $f$. A k-layer neural network can be expressed as $Y=σ(W_k ⋯ σ(W_2 ⋅σ(W_1⋅X)))$.

The activation function is a key component in neural networks that introduces non-linear transformations and determines whether neurons are activated and transmit information. For example, the most commonly used activation function RELU can be written as:

**$RELU(input) = max( 0, input)= \begin{cases} input & \text{if } input > 0 \\ 0 & \text{if } input ≤ 0 \end{cases}$**

When $input≤0$, the neuron is not activated; when $input>0$, the neuron is activated and begins transmitting information to the output.

Expanding one layer of a neural network can be written as follows (assuming $X$ is a $3×2$ dimensional matrix and $Y$ is a $2×2$ dimensional matrix):

$σ(W_{2×3}⋅X_{3×2})= σ(\left[ \begin{matrix} w_{1,1} & w_{1,2} & w_{1,3} \\ w_{2,1} & w_{2,2} & w_{2,3} \end{matrix} \right]×\left[ \begin{matrix} x_{1,1}& x_{1,2}\\ x_{2,1}& x_{2,2} \\ x_{3,1}& x_{3,2} \end{matrix} \right])$

$= σ(\left[\begin{matrix}
w_{1,1}×x_{1,1}+w_{1,2}×x_{2,1}+w_{1,3}×x_{3,1}&
w_{1,1}×x_{1,2}+ w_{1,2}×x_{2,2}+w_{1,3}×x_{3,2} \\
w_{2,1}×x_{1,1}+ w_{1,2}×x_{2,1}+w_{1,3}×x_{3,1}&
w_{2,1}×x_{1,2}+ w_{2,2}×x_{2,2}+w_{2,3}×x_{3,2} \end{matrix} \right])$

$= \left[ \begin{matrix} max(0, \sum\limits_{k=1}^{3}w_{1,k}×x_{k,1})& max(0, \sum\limits_{k=1}^{3}w_{1,k}×x_{k,2})\\ max(0, \sum\limits_{k=1}^{3}w_{2,k}×x_{k,1})& max(0, \sum\limits_{k=1}^{3}w_{2,k}×x_{k,2}) \end{matrix} \right]= \left[ \begin{matrix} y_{1,1}& y_{1,2}\\ y_{2,1}& y_{2,2} \end{matrix} \right]=Y_{2×2}$

Fortunately, the gradient descent method remains effective on high-dimensional, complex functions.

<div style="text-align: center">
<img src="https://img.alicdn.com/imgextra/i3/O1CN011caxP31GiUrEv1aGH_!!6000000000656-2-tps-847-779.png" style="width: 400px; display: block; margin-left: auto; margin-right: auto"/>
</div>

Now you have the ace combination:

**A tool capable of approximating any complex function — neural network + a method capable of fitting data patterns and learning function parameters — gradient descent method**

### 2.2 Efficient Fine-Tuning Techniques

#### 2.2.1 Pre-training and Fine-Tuning

From previous learning, you've already understood the core of model training: finding the optimal combination of parameters.

The model you initially downloaded is a pre-trained set of parameters — the result of extensive training on large-scale data.

Fine-tuning refers to further adjusting these parameters to better suit your specific task (e.g., solving math problems or answering questions in a specialized domain).

Let’s take the `qwen2.5-1.5b-instruct` model as an example to understand the time and hardware requirements for full training from scratch.

---

#### GPU Memory Requirements

*   Memory occupied by 1.5 billion parameters (assuming FP32 precision, with each parameter taking 4 bytes):  
    $ \frac{1.5 \times 10^9 \times 4}{2^{30}} \approx 5.59 \text{ GB} $

*   In practice, training a model typically requires **7–8 times** the memory of its parameter size due to gradients, optimizer states, and intermediate activations. This brings the total GPU memory requirement to around **45 GB**, which exceeds most consumer-grade GPUs and even many cloud-based experimental environments.

---

#### Training Time Estimation

*   Example calculation:  
    - Total training tokens = **200 billion**, or about 250 thousands copies of Shakespeare's complete works
    - Batch size (using 8 GPUs in parallel) = **2,000 tokens per batch**  
    - Throughput = **150 tokens/GPU/sec × 8 GPUs = 1,200 tokens/sec**

*   Estimated training time =  
    $$
    \frac{\text{Total Tokens}}{\text{Batch Size} \times \text{Tokens per Second} \times 86400} \approx 10 \text{ days}
    $$

*   Real-world considerations:  
    Include data preprocessing, checkpoint saving, and distributed communication overhead. Actual training time may increase by **20–50%**. If the dataset grows to **1 trillion tokens**, the training duration could extend to **several months**.

---

#### Training Cost Overview

*   For short-term training (e.g., 10 days), renting cloud GPU instances on a pay-as-you-go basis is often more cost-effective than purchasing dedicated hardware upfront.

*   Training cost formula:  
    $$
    \text{Training Cost} = \text{GPU hourly rate} \times \text{Training time (in hours)}
    $$

---


In summary, **reducing server unit price** and **shortening training time** can effectively reduce training costs, where **reducing memory requirements** can effectively lower server unit price, and reducing **total training data volume** can shorten training time.

<br/>

In the actual model training process, there is also a challenge: **the high cost of obtaining labeled data**, **especially for specific tasks** (e.g., medical image analysis or niche language processing). You can try step-by-step training of the model through "pre-training" and "fine-tuning", where:

*   **Pre-training**: Training the model on a large-scale **general dataset** so that it can learn broad foundational knowledge or feature representations. This knowledge is usually general and not aimed at any specific task. Pre-training is not task-specific but provides a powerful initial model for various downstream tasks. Typical pre-trained models: Qwen2.5-Max, DeepSeek-V3, GPT-4, etc.
    
*   **Fine-tuning**: Further training the model using a **small-scale dataset** specific to a task based on the pre-trained model. The goal is to make the model adapt to specific downstream tasks (e.g., medical, legal, and other professional domain needs).
    

The table below shows the main differences between pre-training and fine-tuning:

<div style="width: 20%">
    
|  **Feature**  |  **Pre-training**  |  **Fine-tuning**  |
| --- | --- | --- |
|  Objective  |  $ $ Learning general features  |  Adapting to specific tasks  |
|  Data  |  Large-scale general data  |  Small-scale task-related data  |
|  Training method  |  Self-supervised/Unsupervised  |  Supervised  |
|  Parameter updates  |  All parameters trainable  |  Partial or all parameters trainable  |
|  Application scenarios  |  Base model construction  |  Specific task optimization  |

</div>

It is worth mentioning that **pre-training generally learns through self-supervised learning**, with data coming from massive texts on the internet (e.g., Wikipedia, books, web pages), allowing the model to find patterns or "guess answers" on its own. This learning method does not require manual annotation, saving a lot of labor costs, making it naturally suitable for learning from massive data.

On the other hand, **fine-tuning is done through supervised learning**, requiring small-scale annotated data for specific tasks (e.g., annotated reviews for sentiment classification, annotated medical texts), and directly teaching the model to complete tasks using annotated data. Due to the high cost of manual annotation, this learning method is difficult to scale to massive data, thus being more suitable for model training with clear scenario goals, typically requiring only a few thousand to tens of thousands of samples.

Therefore, you can quickly and cost-effectively build your large model application in the following ways:

Step 1: Directly choose a pre-trained model (e.g., Qwen, DeepSeek, GPT), which can save the comprehensive cost of training a model from scratch.

Step 2: Fine-tune the model according to your actual scenario, usually only needing to build a few thousand annotated data applicable to the actual scenario, because the total number of training tokens is greatly reduced, effectively shortening the training time, thereby further reducing the training cost.

Fine-tuning can shorten training time, but can fine-tuning the model also reduce memory requirements?

The number of model parameters is the main reason affecting memory requirements. From the perspective of adjusting the size of the parameter count, fine-tuning can be divided into **full-parameter fine-tuning** and **parameter-efficient fine-tuning**.

**Full-parameter fine-tuning (Full Fine Tuning)** is a model optimization method that fine-tunes all parameters based on the pre-trained model, meaning that in the above model structure, any parameter will be adjusted. This method avoids consuming the large amount of computational resources required to retrain all parameters of the model from scratch while avoiding performance degradation due to some parameters not being fine-tuned. However, large model training costs are high, requiring substantial computational resources and large amounts of data. Even with full-parameter fine-tuning, high training costs are often still needed.

**Efficient fine-tuning techniques (PEFT)** significantly reduce the computational cost of large model fine-tuning by adjusting a small number of parameters while maintaining performance close to full-parameter training. Typical methods include Adapter Tuning, Prompt Tuning, and LoRA. Among them, LoRA, which only needs to train small parameter matrices (i.e., low-rank matrices, requiring only 0.1%-1% of the original model's parameters), has become the preferred solution for resource-constrained scenarios. The following focuses on how LoRA achieves parameter-efficient fine-tuning with extremely low parameter counts.


#### 2.2.2 LoRA Fine-tuning

LoRA (Low-Rank Adaptation) fine-tuning is currently the most commonly used method for model adaptation. It does not rely on the internal architecture of the model but instead abstracts and decomposes the parameters that need updating during fine-tuning into two much smaller low-rank matrices $A_{d \times r}$ and $B_{r \times d}$. The original model weights remain frozen, i.e.,  
$$
W^{fine-tuned}_{d \times d} = A_{d \times r} \cdot B_{r \times d} + W^{pre-trained}_{d \times d}
$$

If you're unfamiliar with the concept of low-rank decomposition, let's revisit a simple neural network formulation. Assume the input vector $X$ has dimension 5 and the output vector $Y$ has dimension 4. Then, the weight matrix $W$ would be of size $5 \times 4$, denoted as $W \in \mathbb{R}^{5 \times 4}$, containing a total of 20 parameters.

A single-layer neural network can be expressed as:  
$$
Y_{5 \times 1} = \sigma(W_{5 \times 4} \cdot X_{4 \times 1})
$$

The rank of a matrix intuitively represents its effective information content. For example, although the following matrix has 2 rows and 3 columns, all rows are linearly dependent — one row can represent the others — so its rank is 1:
$$
\text{rank}\left( 
\begin{bmatrix}
1 & 2 & 3 \\
2 & 4 & 6
\end{bmatrix}
\right) = 1
$$

In model fine-tuning, it can be assumed that most of the useful information updates (high-rank) have already been learned during pre-training, while the additional effective information introduced by fine-tuning is minimal (low-rank). This can be written mathematically as:

$$
W_{5 \times 4}^{pre-trained} - W_{5 \times 4}^{initial} = \Delta W_{5 \times 4}^{pre-trained}, \quad \text{rank}(\Delta W_{5 \times 4}^{pre-trained}) = 5
$$
$$
W_{5 \times 4}^{fine-tuned} - W_{5 \times 4}^{pre-trained} = \Delta W_{5 \times 4}^{fine-tuning}, \quad \text{rank}(\Delta W_{5 \times 4}^{fine-tuning}) \leq 2
$$

Since low-rank matrices contain sparse information, they can be efficiently decomposed into two much smaller matrices. Assuming $\text{rank}(\Delta W_{5 \times 4}^{fine-tuning}) = 1$, we can write:

$$
\Delta W_{5 \times 4}^{fine-tuning} =
\begin{bmatrix}
1 & 0 & 2 & -1 \\
2 & 0 & 4 & -2 \\
3 & 0 & 6 & -3 \\
4 & 0 & 8 & -4 \\
5 & 0 & 10 & -5
\end{bmatrix}_{5 \times 4}
=
\begin{bmatrix}
1 \\
2 \\
3 \\
4 \\
5
\end{bmatrix}_{5 \times 1}
\times
\begin{bmatrix}
1 & 0 & 2 & -1
\end{bmatrix}_{1 \times 4}
$$

To illustrate this further, consider the base model `qwen2.5-1.5b-instruct`, where we assume $r = 8$ and $d = 1024$. Below is a comparison of parameter counts:

$$
W^{fine-tuned}_{d \times d} = A_{d \times r} \cdot B_{r \times d} + W^{pre-trained}_{d \times d}
$$

| **Method** | **Parameter Calculation Formula** | **Number of Parameters** | **Savings Ratio** |
| --- | --- | --- | --- |
| Full-parameter fine-tuning | $W_{d \times d}$, $1024 \times 1024$ | 1,048,576 | $0\%$ |
| LoRA fine-tuning | $A_{d \times r}$ and $B_{r \times d}$, $1024 \times 8 + 8 \times 1024$ | 16,384 | $98.44\%$ |

During inference, the matrices $A_{d \times r}$, $B_{r \times d}$, and $W^{pre-trained}_{d \times d}$ can be merged to reconstruct $W^{fine-tuned}_{d \times d}$ either in advance or dynamically.

<div style="text-align: center;">
<a href="https://img.alicdn.com/imgextra/i3/O1CN01NtGavS1TTvIIeZxO1_!!6000000002384-2-tps-804-712.png" target="_blank">
<img src="https://img.alicdn.com/imgextra/i3/O1CN01NtGavS1TTvIIeZxO1_!!6000000002384-2-tps-804-712.png" style="width: 600px;background:white;display: block; margin-left: auto; margin-right: auto"/>
</a>
<br>Image source: LORA's paper - LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
</div>

When using LoRA for fine-tuning, the main tunable hyperparameter is the assumed low-rank $r$. A larger $r$ allows the model to capture more complex feature changes but increases training difficulty, requiring more memory and training epochs.

Empirically, the value of $r$ is closely related to the amount of training data:

- **For small datasets (1k–10k samples):** It is recommended to use $r \leq 16$ to prevent overfitting and excessive training time.
- **For large datasets (100k+ samples):** Try $r \geq 32$ to better explore underlying patterns in the data.

#### 2.2.3 LoRA Fine-tuning Effectiveness

The authors of LoRA compared various fine-tuning methods across two datasets (the x-axis shows the number of trainable parameters, and the y-axis indicates training effectiveness). As shown below, LoRA provides the best cost-effectiveness trade-off.

<div style="text-align: center;">
<img src="https://img.alicdn.com/imgextra/i1/O1CN01RGquUv1ZlDuoik8zU_!!6000000003234-2-tps-1944-662.png" style="width: 700px;background:white;display: block; margin-left: auto; margin-right: auto"/>
</div>

It is clear that not all methods benefit from having more trainable parameters — **more parameters do not necessarily lead to better performance.** However, **the LoRA method demonstrates superior scalability and task performance.**

## 3. Fine-Tuning Practice

### 3.1 Model Training Status and Metrics

Training a model is very similar to the human learning and exam process.

A model must undergo the test of three sets of questions, generating two metrics to determine the state of model training. These are:

Three sets of questions:

*   **Training set**: The practice workbook with detailed answer explanations. The model will repeatedly practice and generate **training loss** based on the loss function. The smaller the training loss, the better the model performs on the provided practice workbook. Combined with the gradient descent method discussed in section 2.1 on how models learn, the model updates its parameters based on the training loss.
    
*   **Validation set**: Simulated exam questions. After the model has learned for a period of time, it will be tested once and generate **validation loss** based on the loss function. Validation loss is used to evaluate the effectiveness of model training. The smaller the validation loss, the better the model performs in the simulated exam.
    
*   **Test set**: Real exam questions. The accuracy of the model on the test set is used to evaluate the final performance of the model.

The three states of model training:

*   **Unchanged or increasing training loss**: This indicates **training failure**. You can think of it as the model not learning anything from the training set (practice workbook), indicating that there is an issue with the model's learning method.
    
*   **Both training loss and validation loss are decreasing**: This indicates that the model is **underfitting**. You can imagine that the model is making progress on the training set (practice workbook) and its performance on the validation set (simulated exam) is also improving, but there is still more room for improvement. At this point, you should let the model continue learning.
    
*   **Decreasing training loss but increasing validation loss**: This indicates that the model is **overfitting**. You can think of it as the model simply memorizing the training set (practice workbook). When taking the exam, it struggles with unseen questions. In this scenario, methods to suppress the model’s tendency to memorize should be applied, such as providing it with 20 more workbooks so that it cannot remember all the questions and is forced to learn the underlying patterns in the questions.

### 3.2 Baseline Model Examination

Before starting the model fine-tuning, let's first take a look at how the baseline model performs on the test set.  



In [7]:
import json
from IPython.display import Markdown

sum, score = 0, 0
for line in open("./resources/2_7/test.jsonl"):
    # Read math questions from the test set
    math_question = json.loads(line)
    query = math_question["messages"][1]["content"]
    # Inference using the baseline model
    response, _ = inference(model, template, query)
    # Get the correct answer
    ans = math_question["messages"][2]["content"]
    pos = ans.find("ans")
    end_pos = ans[pos:].find('}}')
    ans = ans[pos - 2: end_pos + pos + 2]
    # Format output
    print(("========================================================================================"))
    print(query.split("#Math Problem#\n")[1])
    print("The correct answer is: " + ans)
    print("-----------Model Response----------------")
    display(Latex(response))
    print("-----------End of Response----------------")
    # Calculate model score
    if ans in response or ans[6:-2] in response:
        score += 1
        print("Model answered correctly")
    else: print("Model answered incorrectly")
    sum += 1
# Summary
display(Markdown("Model scored: **" + str(int(100*score/sum)) + "** points in the exam"))

Xiaoli is reading a 240-page storybook. On the first day, she read (1/8) of the entire book, and on the second day, she read (1/5) of the entire book. How many pages did she read in total over the two days?
The correct answer is: {{ans:78}}
-----------Model Response----------------


<IPython.core.display.Latex object>

-----------End of Response----------------
Model answered incorrectly
A rectangular dining table is 150 centimeters long, and its width is 100 centimeters shorter than its length. What is the perimeter of this rectangular dining table in centimeters?
The correct answer is: {{ans:400}}
-----------Model Response----------------


<IPython.core.display.Latex object>

-----------End of Response----------------
Model answered incorrectly
The physical education room purchased a batch of small balls. When divided into groups of 3, 4, or 5, there were no remainders each time. What is the minimum number of balls in this batch?
The correct answer is: {{ans:60}}
-----------Model Response----------------


<IPython.core.display.Latex object>

-----------End of Response----------------
Model answered incorrectly
Grandpa is 62 years old this year, and Dongdong is 8 years old. How many times older will Grandpa be than Dongdong next year?
The correct answer is: {{ans:7}}
-----------Model Response----------------


<IPython.core.display.Latex object>

-----------End of Response----------------
Model answered incorrectly
Ms. Wang has 400 yuan. She bought 15 tubes of badminton shuttlecocks, each costing 25 yuan. How much money does she have left?
The correct answer is: {{ans:25}}
-----------Model Response----------------


<IPython.core.display.Latex object>

-----------End of Response----------------
Model answered correctly
The ratio of the number of chickens to ducks is 5:3. If there are 25 chickens, how many ducks are there?
The correct answer is: {{ans:15}}
-----------Model Response----------------


<IPython.core.display.Latex object>

-----------End of Response----------------
Model answered incorrectly
It is known that a No. 4 bus departs every 10 minutes, and a No. 6 bus departs every 15 minutes. After both buses depart simultaneously, they will depart simultaneously again after at least ___ minutes. How many?
The correct answer is: {{ans:30}}
-----------Model Response----------------


<IPython.core.display.Latex object>

-----------End of Response----------------
Model answered correctly
A road construction project started on May 9th and was completed on June 10th. How many days did the construction take?
The correct answer is: {{ans:33}}
-----------Model Response----------------


<IPython.core.display.Latex object>

-----------End of Response----------------
Model answered incorrectly
A football costs 32 yuan, and a volleyball costs 18 yuan. The school bought 8 footballs and 6 volleyballs. How much is the total cost?
The correct answer is: {{ans:364}}
-----------Model Response----------------


<IPython.core.display.Latex object>

-----------End of Response----------------
Model answered incorrectly
The sports field track is 400 meters per lap. Mingming wants to run 1 kilometer. He has already run one lap. How many more meters does he need to run to reach one kilometer?
The correct answer is: {{ans:600}}
-----------Model Response----------------


<IPython.core.display.Latex object>

-----------End of Response----------------
Model answered correctly


Model scored: **30** points in the exam

The baseline model often abandons reasoning midway during exams, struggling to provide correct answers. This performance not only confirms that the question difficulty exceeds its processing capability but also reveals the fundamental reason why prompt engineering is ineffective — the model itself lacks the necessary problem-solving ability. Fine-tuning the model is the only solution.  



### 3.3 Model Fine-tuning

Here, we use the [ms-swift](https://github.com/modelscope/ms-swift/tree/main) (Modelscope Scalable lightWeight Infrastructure for Fine-Tuning) framework, an open-source framework specifically developed by Alibaba's ModelScope community for model training. This framework supports the training (pre-training, fine-tuning, alignment), inference, evaluation, and deployment of over 350 large language models (LLMs) and more than 90 multi-modal LLMs (MLLMs).

Moreover, the ms-swift framework is very convenient to use. Each time it calculates the validation loss (evaluation loss), the framework automatically saves the current model parameters (model_checkpoint) of the training phase and automatically saves the parameters with the smallest validation loss at the end of the training, which corresponds to the (best_model_checkpoint) in the figure below.

<div style="text-align: center;">
<img src="https://img.alicdn.com/imgextra/i3/O1CN0150XsFO1xM4z7CUMNr_!!6000000006428-2-tps-2288-136.png" style="width: 70%;display: block; margin-left: auto; margin-right: auto"/>
</div>

In the subsequent multiple experiments, we will focus on adjusting three parameters: learning_rate, LoRA (lora_rank), and the number of dataset training epochs (num_train_epochs). We will also replace the dataset to demonstrate how to perform LoRA fine-tuning. Adjustments to other parameters are made to facilitate the presentation of experimental results, such as increasing the batch size (batch_size) to shorten training time, which you do not need to pay too much attention to.



#### 3.3.1 First Experiment (Takes 1 minute)

In the initial experiment, it is recommended that you first fine-tuning according to the following parameter settings, and use a dataset of 100 problem solutions generated by DeepSeek-R1 for training, so as to improve the training effect through parameter optimization in subsequent experimental stages:

| Parameter | Parameter Value |
| --- | --- |
| learning rate (learning_rate) | 0.1 |
| LoRA Rank (lora_rank) | 4 |
| Number of Training Epochs (num_train_epochs) | 1 |
| Dataset Location (dataset) | Dataset Location: current directory/resources/2_4/train_100.jsonl |
| You can adjust all parameters freely, but due to display effects and memory constraints, there are the following limitations: | batch_size <= 16 (memory constraint) <br>max_length <= 512 (maximum length of each training data, memory constraint) <br>lora_rank <= 64 (LoRA rank, memory constraint) <br>eval_step <= 20 (for convenience of display) |

Start the experiment:<br/>
The fine-tuning module of the ms-swift framework uses LoRA fine-tuning by default, so there is no need to explicitly declare the fine-tuning method in the experiment.<br/>
At the same time, the framework will intelligently reduce the actual learning rate during the training process to ensure that the model does not always skip the optimal solution.



In [10]:
%env CUDA_VISIBLE_DEVICES=0
%env LOG_LEVEL=INFO
!swift sft \
--learning_rate '0.1' \
--lora_rank 4 \
--num_train_epochs 1 \
--dataset './resources/2_7/train_100.jsonl' \
--batch_size '8' \
--max_length 512 \
--eval_step 1 \
--model_type 'qwen2_5-1_5b-instruct' \
--model_id_or_path './model'

env: CUDA_VISIBLE_DEVICES=0
env: LOG_LEVEL=INFO
run sh: `/usr/local/bin/python /usr/local/lib/python3.10/site-packages/swift/cli/sft.py --learning_rate 0.1 --lora_rank 4 --num_train_epochs 1 --dataset ./resources/2_7/train_100.jsonl --batch_size 8 --max_length 512 --eval_step 1 --model_type qwen2_5-1_5b-instruct --model_id_or_path ./model`
[INFO:swift] Successfully registered `/usr/local/lib/python3.10/site-packages/swift/llm/data/dataset_info.json`
[INFO:swift] Start time of running main: 2025-07-15 13:39:51.537119
[INFO:swift] Setting template_type: qwen2_5
[INFO:swift] Setting args.lazy_tokenize: False
[INFO:swift] Setting args.dataloader_num_workers: 1
[INFO:swift] output_dir: /mnt/workspace/alibabacloud_acp_learning/ACP/p2_Build LLM Q&A System/output/qwen2_5-1_5b-instruct/v2-20250715-133951
[INFO:swift] args: SftArguments(model_type='qwen2_5-1_5b-instruct', model_id_or_path='/mnt/workspace/alibabacloud_acp_learning/ACP/p2_Build LLM Q&A System/model', model_revision='master', full_

| Training loss image | Evaluation loss image |
| --- | --- |
|<img src="https://img.alicdn.com/imgextra/i2/O1CN0122CqML1xiykiTglmo_!!6000000006478-2-tps-667-451.png" style="width: 500px;display: block; margin-left: auto; margin-right: auto"/> | <img src="https://img.alicdn.com/imgextra/i4/O1CN01AxXE0V1JqEORoVBdi_!!6000000001079-2-tps-667-451.png" style="width: 500px;display: block; margin-left: auto; margin-right: auto"/> |

| **Observation metrics (training loss, validation loss):** | Training loss increases, validation loss increases |
| --- | --- |
| **Training status:** | **Training failed** |
| **Cause analysis:** | It is highly likely that the learning rate is too high, causing the model parameters to oscillate repeatedly near the optimal solution and fail to find the optimal solution, resulting in training failure.<img src="https://img.alicdn.com/imgextra/i1/O1CN01l4leTB1LKI0BcVs16_!!6000000001280-2-tps-1658-1262.png" style="width: 300px;display: block; margin-left: auto; margin-right: auto"/>|
| **Adjustment logic:** | Significantly reduce the learning rate to $0.00005$, allowing the model to "learn cautiously" with smaller steps. |

#### 3.3.2 Second Experiment (requires 2 minutes)

<div style="width: 30%">
    
| Parameter | Old parameter value | New parameter value |
| --- | --- | --- |
| Learning rate (learning_rate) | 0.1 $ $ | 0.00005 |
    
</div>  



In [11]:
%env CUDA_VISIBLE_DEVICES=0
%env LOG_LEVEL=INFO
!swift sft \
--learning_rate '0.00005' \
--lora_rank 4 \
--num_train_epochs 1 \
--dataset './resources/2_7/train_100.jsonl' \
--batch_size '8' \
--max_length 512 \
--eval_step 1 \
--model_type 'qwen2_5-1_5b-instruct' \
--model_id_or_path './model'

env: CUDA_VISIBLE_DEVICES=0
env: LOG_LEVEL=INFO
run sh: `/usr/local/bin/python /usr/local/lib/python3.10/site-packages/swift/cli/sft.py --learning_rate 0.00005 --lora_rank 4 --num_train_epochs 1 --dataset ./resources/2_7/train_100.jsonl --batch_size 8 --max_length 512 --eval_step 1 --model_type qwen2_5-1_5b-instruct --model_id_or_path ./model`
[INFO:swift] Successfully registered `/usr/local/lib/python3.10/site-packages/swift/llm/data/dataset_info.json`
[INFO:swift] Start time of running main: 2025-07-15 13:40:32.669175
[INFO:swift] Setting template_type: qwen2_5
[INFO:swift] Setting args.lazy_tokenize: False
[INFO:swift] Setting args.dataloader_num_workers: 1
[INFO:swift] output_dir: /mnt/workspace/alibabacloud_acp_learning/ACP/p2_Build LLM Q&A System/output/qwen2_5-1_5b-instruct/v3-20250715-134032
[INFO:swift] args: SftArguments(model_type='qwen2_5-1_5b-instruct', model_id_or_path='/mnt/workspace/alibabacloud_acp_learning/ACP/p2_Build LLM Q&A System/model', model_revision='master', f

| Training loss image | Evaluation loss image |
| --- | --- |
|<img src="https://img.alicdn.com/imgextra/i3/O1CN01DgtNVX1EDgzHYamOE_!!6000000000318-2-tps-680-451.png" style="width: 500px;display: block; margin-left: auto; margin-right: auto"/> | <img src="https://img.alicdn.com/imgextra/i3/O1CN01621v4k1ErzqC24Z1b_!!6000000000406-2-tps-689-451.png" style="width: 500px;display: block; margin-left: auto; margin-right: auto"/> |

| **Observation Metrics (Training Loss, Validation Loss):** | Training loss decreases, validation loss also decreases |
| --- | --- |
| **Training Status:** | **Underfitting** |
| **Cause Analysis:** | Underfitting is a very common phenomenon during training. It indicates that, with the parameters unchanged, simply allowing the model to train longer can lead to successful training. Of course, modifying the parameters can also accelerate the training process. |
| **Adjustment Logic:** | 1. Let the model train longer: Increase the number of dataset learning cycles `epoch` to 50. <br/> 2. Adjust `batch_size` to the maximum value of 16 to speed up model training. |

#### 3.3.3 Third Experiment (Takes 10 minutes)

<div style="width: 50%">

| Parameter | Old Parameter Value | New Parameter Value |
| :--- | :--- | :--- |
| Number of Training Epochs (num_train_epochs) | 1 | 50 |
| batch_size | 8 | 16 |
| eval_step | 1 | 20 (Optimized output display) |

</div> 



In [12]:
%env CUDA_VISIBLE_DEVICES=0
%env LOG_LEVEL=INFO
!swift sft \
--learning_rate '0.00005' \
--lora_rank 4 \
--num_train_epochs 50 \
--dataset './resources/2_7/train_100.jsonl' \
--batch_size '16' \
--max_length 512 \
--eval_step 20 \
--model_type 'qwen2_5-1_5b-instruct' \
--model_id_or_path './model'

env: CUDA_VISIBLE_DEVICES=0
env: LOG_LEVEL=INFO
run sh: `/usr/local/bin/python /usr/local/lib/python3.10/site-packages/swift/cli/sft.py --learning_rate 0.00005 --lora_rank 4 --num_train_epochs 50 --dataset ./resources/2_7/train_100.jsonl --batch_size 16 --max_length 512 --eval_step 20 --model_type qwen2_5-1_5b-instruct --model_id_or_path ./model`
[INFO:swift] Successfully registered `/usr/local/lib/python3.10/site-packages/swift/llm/data/dataset_info.json`
[INFO:swift] Start time of running main: 2025-07-15 13:43:24.447460
[INFO:swift] Setting template_type: qwen2_5
[INFO:swift] Setting args.lazy_tokenize: False
[INFO:swift] Setting args.dataloader_num_workers: 1
[INFO:swift] output_dir: /mnt/workspace/alibabacloud_acp_learning/ACP/p2_Build LLM Q&A System/output/qwen2_5-1_5b-instruct/v4-20250715-134324
[INFO:swift] args: SftArguments(model_type='qwen2_5-1_5b-instruct', model_id_or_path='/mnt/workspace/alibabacloud_acp_learning/ACP/p2_Build LLM Q&A System/model', model_revision='master'

| Training loss image | Evaluation loss image |
| --- | --- |
|<img src="https://img.alicdn.com/imgextra/i4/O1CN01xsw3a31YarKvsEKCR_!!6000000003076-2-tps-671-451.png" style="width: 500px;display: block; margin-left: auto; margin-right: auto"/> | <img src="https://img.alicdn.com/imgextra/i3/O1CN01b2v3fK1jOSNo73Q3y_!!6000000004538-2-tps-680-451.png" style="width: 500px;display: block; margin-left: auto; margin-right: auto"/> |

| **Observation Metrics (Training Loss, Validation Loss):** | Training loss decreases, validation loss first decreases then increases |
| --- | --- |
| **Training Status:** | **overfitting** |
| **Cause Analysis:** | overfitting is also a very common phenomenon during training. It indicates that the model is "memorizing questions" and not learning the knowledge in the dataset. We can reduce the number of epochs or increase the amount of data to make the model "forget the questions." |
| **Adjustment Logic:** | 1. Reduce the number of epochs to 5. <br/> 2. Expand the number of problem solutions generated by DeepSeek-R1 to 1000 entries. Dataset location: current directory/resources/2_4/train_1k.jsonl <br/> 3. After increasing the amount of data, increase the rank of LoRA to 16 based on previous learning. |

In general, with the scale of today's large language models (LLMs), fine-tuning requires at least **1000+** high-quality training dataset entries. When below this threshold, the model tends to "memorize questions" after a few rounds of training instead of learning the inherent knowledge within the data.

#### 3.3.4 Fourth Experiment (Expected to take 5 minutes)

| Parameter | Old Value | New Value |
| --- | --- | --- |
| Change Dataset | 100 entries | 1000+ entries |
| Number of Training Epochs (num_train_epochs) | 50 | 3 |
| LoRA Rank (lora_rank) | 4 | 8 (For reasons why this was increased, refer to the LoRA introduction). | 



In [19]:
%env CUDA_VISIBLE_DEVICES=0
%env LOG_LEVEL=INFO
!swift sft \
--learning_rate '0.00005' \
--lora_rank 8 \
--num_train_epochs 3 \
--dataset './resources/2_7/train_1k.jsonl' \
--batch_size '16' \
--max_length 512 \
--eval_step 20 \
--model_type 'qwen2_5-1_5b-instruct' \
--model_id_or_path './model'

env: CUDA_VISIBLE_DEVICES=0
env: LOG_LEVEL=INFO
run sh: `/usr/local/bin/python /usr/local/lib/python3.10/site-packages/swift/cli/sft.py --learning_rate 0.00005 --lora_rank 8 --num_train_epochs 3 --dataset ./resources/2_7/train_1k.jsonl --batch_size 16 --max_length 512 --eval_step 20 --model_type qwen2_5-1_5b-instruct --model_id_or_path ./model`
[INFO:swift] Successfully registered `/usr/local/lib/python3.10/site-packages/swift/llm/data/dataset_info.json`
[INFO:swift] Start time of running main: 2025-07-15 16:24:56.914902
[INFO:swift] Setting template_type: qwen2_5
[INFO:swift] Setting args.lazy_tokenize: False
[INFO:swift] Setting args.dataloader_num_workers: 1
[INFO:swift] output_dir: /mnt/workspace/alibabacloud_acp_learning/ACP/p2_Build LLM Q&A System/output/qwen2_5-1_5b-instruct/v10-20250715-162456
[INFO:swift] args: SftArguments(model_type='qwen2_5-1_5b-instruct', model_id_or_path='/mnt/workspace/alibabacloud_acp_learning/ACP/p2_Build LLM Q&A System/model', model_revision='master',

| Training loss image | Evaluation loss image |
| --- | --- |
|<img src="https://img.alicdn.com/imgextra/i3/O1CN01p8rX0d1UAyUOGHeOJ_!!6000000002478-2-tps-671-451.png" style="width: 500px;display: block; margin-left: auto; margin-right: auto"/> | <img src="https://img.alicdn.com/imgextra/i1/O1CN01LjmbJ21P4Uo8ZJyav_!!6000000001787-2-tps-689-451.png" style="width: 500px;display: block; margin-left: auto; margin-right: auto"/> |


| **Observation Metrics (Training Loss, Validation Loss):** | Training loss decreases, validation loss also decreases |
| --- | --- |
| **Training Status:** | **Underfitting** |
| **Reason Analysis:** | Training is almost successful! |
| **Adjustment Logic:** | Let the model train more: Increase the number of dataset learning iterations (epoch) to 15. |

#### 3.3.5 Fifth Experiment (Requires 20 Minutes)

| Parameter | Old Parameter Value | New Parameter Value |
| --- | --- | --- |
| Number of Training Epochs (num_train_epochs) | 3 | 15 |



In [18]:
%env CUDA_VISIBLE_DEVICES=0
%env LOG_LEVEL=INFO
!swift sft \
--learning_rate '0.00005' \
--lora_rank 8 \
--num_train_epochs 15 \
--dataset './resources/2_7/train_1k.jsonl' \
--batch_size '16' \
--max_length 512 \
--eval_step 20 \
--model_type 'qwen2_5-1_5b-instruct' \
--model_id_or_path './model'

env: CUDA_VISIBLE_DEVICES=0
env: LOG_LEVEL=INFO
run sh: `/usr/local/bin/python /usr/local/lib/python3.10/site-packages/swift/cli/sft.py --learning_rate 0.00005 --lora_rank 8 --num_train_epochs 15 --dataset ./resources/2_7/train_1k.jsonl --batch_size 16 --max_length 512 --eval_step 20 --model_type qwen2_5-1_5b-instruct --model_id_or_path ./model`
[INFO:swift] Successfully registered `/usr/local/lib/python3.10/site-packages/swift/llm/data/dataset_info.json`
[INFO:swift] Start time of running main: 2025-07-15 15:08:32.802459
[INFO:swift] Setting template_type: qwen2_5
[INFO:swift] Setting args.lazy_tokenize: False
[INFO:swift] Setting args.dataloader_num_workers: 1
[INFO:swift] output_dir: /mnt/workspace/alibabacloud_acp_learning/ACP/p2_Build LLM Q&A System/output/qwen2_5-1_5b-instruct/v9-20250715-150832
[INFO:swift] args: SftArguments(model_type='qwen2_5-1_5b-instruct', model_id_or_path='/mnt/workspace/alibabacloud_acp_learning/ACP/p2_Build LLM Q&A System/model', model_revision='master',

| Training loss image | Evaluation loss image |
| --- | --- |
|<img src="https://img.alicdn.com/imgextra/i4/O1CN01hyQhbn1p04zyTeQkv_!!6000000005297-2-tps-671-451.png" style="width: 500px;display: block; margin-left: auto; margin-right: auto"/> | <img src="https://img.alicdn.com/imgextra/i3/O1CN01oy2oZv1r0ejEmpYdQ_!!6000000005569-2-tps-680-451.png" style="width: 500px;display: block; margin-left: auto; margin-right: auto"/> |


| **Observation Metrics (Training Loss, Evaluation Loss):** | Training loss basically does not decrease, evaluation loss also basically does not decrease and even slightly increases |
| --- | --- |
| **Training Status:** | **Training Successful!** |  



### 3.4 Examination After Fine-Tuning

After fine-tuning, two `checkpoint` files are generally saved: `best_model_checkpoint` (the model parameters that performed best on the validation set) and `last_model_checkpoint` (the model parameters at the completion of the fine-tuning task).

Here, replace the `ckpt_dir` in the code below with the address of the `best_model_checkpoint`, and you will be able to call the fine-tuned model.

First, let's load the model into memory:  



In [22]:
from swift.tuners import Swift

# Please modify ckpt_dir to the correct location before running
ckpt_dir = 'output/qwen2_5-1_5b-instruct/v9-20250715-150832/checkpoint-1035' # Modify to your checkpoint location before running
# Load the model
ft_model = Swift.from_pretrained(model, ckpt_dir, inference_mode=True)

Let's take a look at the performance of the fine-tuned model in the exam.  



In [None]:
import json
sum, score = 0, 0.0
for line in open("./resources/2_7/test.jsonl"):
    # Read math questions from the test set
    math_question = json.loads(line)
    query = math_question["messages"][1]["content"]
    # Use the fine-tuned model for inference
    response, _ = inference(ft_model, template, query)
    # Get the correct answer
    ans = math_question["messages"][2]["content"]
    pos = ans.find("ans")
    end_pos = ans[pos:].find('}}')
    ans = ans[pos - 2: end_pos + pos + 2]
    # Organize output
    print(("========================================================================================"))
    print(query.split("#Math Problem#\n")[1])
    print("The answer to the question is: " + ans)
    print("-----------Model Response----------------")
    display(Latex(response))
    print("-----------End of Response----------------")
    # Calculate the model's score
    if ans in response:
        score += 1
        print("The model answered correctly")
    elif ans[6 : -2] in response:
        score += 0.5
        print("The model answered correctly but the output format was incorrect")
    else: print("The model answered incorrectly")
    sum += 1
# Summary
display(Markdown("The fine-tuned model scored **" + str(int(100*score/sum)) + "** points on the exam"))

### 3.5 Parameter Matrix Fusion

After the model training is completed, there are two ways to use the trained model:

1. Dynamically load the fine-tuned model at the time of invocation.

   The low-rank parameter matrix obtained after fine-tuning occupies only 20MB of storage space, which is very convenient for incremental deployment and distribution. This is a commonly used method in engineering. It should be noted that whichever base model is used for fine-tuning, the corresponding base model must be specified upon loading.

   In the previous subsection, we have already tried this method by specifying `ckpt_dir`.

2. Merge the base model with the low-rank parameters obtained from fine-tuning to create a complete model with updated parameters, and then invoke the merged model.

Here, we introduce the second method: merging the "fine-tuned parameter matrix" with the "base model parameter matrix" to store the fine-tuned model parameters as a complete parameter matrix.

By using the `swift export` method and providing the path of the fine-tuned model (it is recommended to provide the `best_model_checkpoint`), the merged model can be obtained.

In [None]:
%env LOG_LEVEL=INFO
!swift export \
--ckpt_dir 'output/qwen2_5-1_5b-instruct/vx-xxx/checkpoint-xxx<Modify to checkpoint location before running>' \
--merge_lora true

The log displays the path of the model after fusion. The complete parameter matrix after fusion is stored by default in the `checkpoint` directory. (The complete model parameters for the PAI experimental environment are located at: `output/qwen2_5-1_5b-instruct/vX-XXX/checkpoint-XX-merged`).  



## ✅ Summary

In this section, we have learned the following:

* Understanding the core value of model fine-tuning: directly improving the model's reasoning ability in mathematics through targeted data injection, overcoming the limitations of prompt engineering and RAG chatbot.

* Mastering key training parameters: learning rate controls the magnitude of parameter updates, epoch determines the number of data traversals, batch size affects gradient stability, and the loss function monitors the training status.

* Understanding the principle of LoRA efficient fine-tuning: reducing memory consumption based on low-rank matrix decomposition (theoretical explanation), and optimizing training effects by adjusting the lora_rank parameter in practice.

* Completing iterative hyperparameter tuning experiments: solving underfitting and overfitting problems through multiple adjustments of learning rate/data volume/training rounds, ultimately significantly improving the model's problem-solving accuracy.

Although you can use pre-prepared datasets in this tutorial to experience GPU computing resources for free during fine-tuning, **in actual production, fine-tuning is not simple and requires comprehensive consideration of factors such as computing power costs, data scale, and quality**.
Particularly, attention should be paid to the following aspects:
1. Whether low-cost solutions like prompt engineering and RAG chatbot are sufficient to handle the problem.
2. Whether the amount and quality of data meet the minimum threshold for fine-tuning (at least 1000 high-quality data points).
3. Ensuring that the project budget matches the technical expertise, with acceptable cost-effectiveness.

### Further Learning
#### Fine-tuning for More Machine Learning Tasks

* Image classification (e.g., object recognition, medical image diagnosis)
    * Fine-tuning purpose: Optimize feature extraction capabilities for specific image datasets based on pre-trained models (e.g., ResNet, ViT).
    * Key points: Reduce data requirements and leverage the general visual knowledge of pre-trained models to transfer to niche tasks.

* Object detection (e.g., autonomous driving, security monitoring)
    * Fine-tuning purpose: Adjust the model (e.g., YOLO, Faster R-CNN) for detecting specific objects or scenes.
    * Key points: Optimize the model's sensitivity to target location and category, reducing false positives/missed detections.

* Machine translation (e.g., domain-specific translation)
    * Fine-tuning purpose: Adapt a general translation model (e.g., mBART, T5) to professional terminology and expression habits.
    * Key points: Address semantic bias issues in vertical domain translations for general models.

* Recommendation systems (e.g., e-commerce, content platforms)
    * Fine-tuning purpose: Optimize recommendation models (e.g., collaborative filtering, deep ranking models) based on user behavior data.
    * Key points: Balance personalized recommendations with cold-start problems, improving click-through rates/conversion rates.

#### More Efficient Fine-tuning Methods

* **Freeze**: This method was one of the earliest PEFT methods. It freezes most of the model's parameters during fine-tuning, training only a small portion of the model’s parameters (e.g., the last few neural network layers) to quickly adapt to specific task needs. Characteristics:
    * High parameter efficiency (only a small number of parameters are trained).
    * Suitable for scenarios where the task is close to the pre-training objective (e.g., text classification).
    * May not perform well for complex tasks.
<div style="text-align: left;">
<img src="https://img.alicdn.com/imgextra/i1/O1CN01X9GOk81sgAEtxflGR_!!6000000005795-2-tps-1340-686.png" style="width: 600px;display: block; margin-left: 60px; margin-right: auto"/>
</div>

* **Adapter Tuning**: In the original model architecture, Adapter layers are inserted between certain positions. During fine-tuning, the model’s original parameters are not trained; only these Adapter layers are trained while the original parameters do not participate in training. Characteristics:
    * Modular design with strong compatibility.
    * Slightly higher parameter count than LoRA but stable performance.
    * Requires modifying the model structure, with additional computation required during inference.
<div style="text-align: left;">
<img src="https://img.alicdn.com/imgextra/i2/O1CN016gccCd1CdDpjDxbe9_!!6000000000103-2-tps-1482-1048.png" style="width: 500px;display: block; margin-left: 60px; margin-right: auto"/>
</div>

* Prompt Tuning: Indirectly control model behavior by optimizing learnable vectors (Prompt) at the input, freezing model parameters. Characteristics:
    * No need to modify the model structure; only adjust the input.
    * Friendly to generative tasks (e.g., translation, dialogue).
    * Effect depends on prompt design; may be insufficient for complex tasks.

#### Fine-tuning Dataset Construction Strategy

Generally speaking, for more complex scenarios, fine-tuning requires at least **1000+ high-quality training dataset samples**. When constructing the dataset, confirm the following points:

* **Data Quality**: Ensure the dataset is accurate, relevant, and remove ambiguous or incorrect samples.
* **Diversity Coverage**: Include full scenarios, multi-contexts, and professional terminology of the task to avoid single distribution.
* **Class Balance**: If the task involves multiple class scenarios, ensure balanced samples across classes to prevent model bias towards one class.
* **Continuous Iteration**: Fine-tuning is an iterative process; continuously optimize and expand the dataset based on feedback from the model's performance on the validation set.

If you lack data when fine-tuning a model, it is recommended to enhance the model's capabilities using knowledge base retrieval (e.g., business documents, FAQs).

> In many complex business scenarios, a combined approach of model optimization and knowledge base retrieval can be adopted.

You can also use the following strategies to expand the dataset:

* **Manual Annotation**: Extend typical scenario data by experts.
* **Model Generation**: Simulate business scenario data using LLMs.
* **External Collection**: Obtain data through web scraping, public datasets, user feedback, etc.

#### Common Evaluation Metrics for Models

Evaluation metrics differ significantly for different types of tasks. Below are some typical task evaluation metrics:

* **Classification Tasks**:
    * Accuracy: The proportion of correct predictions.
    * Precision, Recall, and F1 Score: Used to measure the identification effect of positive classes in binary or multi-class classification problems.

* **Text Generation Tasks**:
    * BLEU (Bilingual Evaluation Understudy): Mainly used in natural language processing tasks such as machine translation, calculating scores by comparing n-gram overlaps between candidate translations and one or more reference translations.
    * ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Commonly used for automatic summary evaluation, based on n-gram recall, precision, and F-measure.
    * Perplexity: Measures the uncertainty of a probability distribution model predicting a sample; lower is better.

* **Image Recognition/Object Detection**:
    * Intersection over Union (IoU): The ratio of the intersection area to the union area of two bounding boxes.
    * mAP (mean Average Precision): Widely used in object detection tasks.

## 🔥 Post-Class Quiz

### 🔍 Single-Choice Question
<details>
<summary style="cursor: pointer; padding: 12px; border: 1px solid #dee2e6; border-radius: 6px;">
<b>Which of the following statements about LoRA is incorrect ❓</b>

- A. LoRA can effectively reduce the cost of fine-tuning large language models.
- B. LoRA modifies the original weights of the fine-tuned model.
- C. LoRA's implementation is relatively simple and easy to integrate.
- D. The results of LoRA fine-tuning can be easily reverted.

**[Click to View Answer]**
</summary>

<div style="margin-top: 10px; padding: 15px; border: 1px solid #dee2e6; border-radius: 0 0 6px 6px;">

✅ **Reference Answer: B**  
📝 **Explanation**:  
- LoRA does not directly modify the original weights but indirectly affects model behavior by adding low-rank matrices. This makes rollback operations simple, as you only need to remove the added low-rank matrices.

</div>
</details>

---


### 🔍 Multiple-Choice Question
<details>
<summary style="cursor: pointer; padding: 12px; border: 1px solid #dee2e6; border-radius: 6px;">
<b>You are using Swift to fine-tune a Qwen model and notice a significant upward trend in loss on the validation set. Which of the following actions can help alleviate or resolve this issue ❓</b>

- A. Increase learning rate
- B. Decrease learning rate
- C. Increase --num_train_epochs
- D. Decrease --num_train_epochs

**[Click to View Answer]**
</summary>

<div style="margin-top: 10px; padding: 15px;  border: 1px solid #dee2e6; border-radius: 0 0 6px 6px;">

✅ **Reference Answer: BD**  
📝 **Explanation**:  
- learning_rate: A high learning rate can lead to fast model training but may cause oscillations near the optimal solution, or even non-convergence, resulting in fluctuating loss, which may appear like overfitting. However, this is different from true overfitting.  
- num_train_epochs: Overfitting may also be caused by too many training epochs. Reducing the number of training epochs can prevent the model from over-learning the training data.

</div>
</details>  

