# To All Reviewers

We appreciate your thorough evaluation and helpful suggestions and comments. In our response to each of you, we have provided point-by-point responses to your specific comments. We hope our response addresses all the concerns raised in your review. Additionally, we would like to clarify the main contributions of this study, which can be summarized as follows:

- To the best of our knowledge, this is the first study to highlight the importance of calibration for both multiclass classifier and OOD detector in the safe SSL with label distribution mismatch. Through extensive experimental results, we thoroughly investigated the importance of calibration, its specific role in SSL under label distribution mismatch, and its effectiveness for safe SSL.
- The present study proposed two well-calibrated scores, confidence and OOD score, to present a new safe SSL method. The two scores effectively filter out samples from unseen classes within the unlabeled samples for training and improve the quality of pseudo-labels used in SSL.
- The proposed two scores are demonstrated to be effective in safe SSL, achieving state-of-the-art performance for all eight SSL scenarios with OOD data presence.


# To Reviewer x6WL’s Comments

> Weaknesses.1) While the final experimental results are quite good, the innovation of using calibration techniques to calibrate the OOD detector and model is limited, and the inspiration it provides to the field is also limited.

Response) To the best of our knowledge, our work is the first one that demonstrates the importance of calibration for both multiclass classifier and OOD detector in the context of SSL under label distribution mismatch. We proposed adaptive label smoothing with temperature scaling and applied it to both multiclass classifier and OOD detector. Through extensive experiments, including ablation studies, on various image classification benchmark datasets, we thoroughly investigated the importance of calibration, its specific role in SSL under label distribution mismatch, and its effectiveness for safe SSL. We believe the results and discussions are a unique contribution of our work, which can acknowledge the importance of calibration in the safe SSL research to the machine learning community.

> Weaknesses.2) Based on the observations from Figure 3, it is evident that $\tau_1$ to some extent influences the method's effectiveness. Selecting an appropriate for different datasets appears to be crucial. I am uncertain whether a fixed $\tau_1$ can ensure optimal performance across diverse datasets. It might be beneficial to conduct experiments on additional datasets or consider an adaptive adjustment mechanism for $\tau_1$.

Response) We agree that the choice of $\tau_{1}$ could be crucial for the effectiveness of our method. Thus, we have done a sensitivity analysis on CIFAR-10 with the 60\% mismatch ratio. We found that the downstream task accuracy depends on the choice of $\tau_{1}$ as the reviewer's concern, but in all cases, CaliMatch outperformed all other safe SSL methods, including OpenMatch. We have developed a data-driven approach that determines $\tau_1$ by averaging CaliMatch's seen-class scores $s_i$ on the in-distribution validation samples and updating $\tau_1$ per epoch during training. We will further investigate the effectiveness of our method to determine $\tau_{1}$ in other datasets and will include it in the revised manuscript.

> Weaknesses.3) Although the feasibility of the method has been analyzed from an empirical perspective, incorporating a theoretical analysis could make this work more solid with more insights and inspiration for researchers in the field.

Response) We did a theoretical analysis based on the works [1,2] to give an insight into why better-calibrated models result in better downstream task performance.

Let us consider labeled seen-class data as $D\_{\ell'}=\{(x_{i}^{u},y_{i}^{u})\in D_{u}\times\mathcal{Y}\}$ sampled from $D_u$. Minimizing difference between two FixMatch losses for $D\_{\ell'}$ and $D_u$ can be regarded as debiasing the empirical risk for SSL, following Theorem D.1 of [2], although this has not yet been proven under the label distribution mismatch assumption. The difference is  $\mathcal{L'}\_{\text{Fix}}(D\_{\ell'};\mathcal{T}\_s)-\mathcal{L'}\_{\text{Fix}}(D\_u;\mathcal{T}\_w,\mathcal{T}\_s)$, and the two FixMatch losses are defined as follows:
$$
    \mathcal{L'}\_{\text{Fix}}(D\_{\ell'};\mathcal{T}\_s)=\sum\_{i=1}^{|D\_{\ell'}|}\sum\_{k=1}^K-\mathbb{I}(y\_i^u=k)\log p\_{k}(\mathcal{T}\_{s}(x\_{i}^{u}))
$$
$$
\mathcal{L'}\_{\text{Fix}}(D\_u;\mathcal{T}\_w,\mathcal{T}\_s)=\sum\_{i=1}^{|D\_u|}\mathbb{I}(c(x\_i^u)>\tau\_2 )\sum\_{k=1}^K -\mathbb{I}(\text{argmax}\_{l} p\_{l}(\mathcal{T}\_{w}(x\_{i}^{u}))=k)\log p\_{k}(\mathcal{T}\_{s}(x\_{i}^{u})),
$$
where $c(x_i^u)$ represents confidence $\text{max}\_{k\in\mathcal{Y}}p\_{k}(\mathcal{T}\_{w}(x\_{i}^{u}))$ in classification. Additionally, when we define unlabeled OOD data as $D_{\text{OOD}}$ from $D_u$, the difference is bounded by two terms based on the triangle inequality:
$$
|\mathcal{L'}\_{\text{Fix}}(D\_{\ell'};\mathcal{T}\_s)-\mathcal{L'}\_{\text{Fix}}(D\_u;\mathcal{T}\_w,\mathcal{T}\_s)|\leq|\mathcal{L'}\_{\text{Fix}}(D\_{\ell'};\mathcal{T}\_s)-\mathcal{L'}\_{\text{Fix}}(D\_u\setminus D\_{\text{OOD}};\mathcal{T}\_w,\mathcal{T}\_s)| + |\mathcal{L'}\_{\text{Fix}}(D\_{\text{OOD}};\mathcal{T}\_w,\mathcal{T}\_s)|.
$$
If we prove that reducing the two bound terms of the difference minimizes the overall difference, we can establish the following research hypotheses, which would be valuable for future work:
- Ensuring high-quality pseudo-labels, based on well-calibrated confidence $c(x_i^u)$ greater than $\tau_2$ among the unlabeled seen-class data $D_u\setminus D_{\text{OOD}}$, reduces the first bound of the difference, therefore debiasing the empirical risk for safe SSL.
- Excluding the unlabeled unseen-class data $D_{\text{OOD}}$ based on well-calibrated OOD score from $\mathcal{L'}\_{\text{Fix}}\$, reduces the second bound of the difference, therefore debiasing the empirical risk for safe SSL.

> Question.1) Given that OOD methods and calibration methods have been previously explored, in what ways does the proposed framework distinguish itself or advance beyond existing approaches in terms of innovation or application?

Response) Please refer to our response to weaknesses-1.

> Question.2) Does CaliMatch have the capability to detect outliers that closely resemble inliers, addressing a common challenge in outlier detection?

Response) When outliers have similar visual characteristics to those of inliers, detecting such outliers is challenging. However, Azizmalayeri et al. [3] suggested that measuring overconfidence in OOD detection scores can help detect outliers created by adding Gaussian noise to inliers. Combined with their method, CaliMatch may have better OOD detection capabilities when it comes to samples that closely resemble inliers.

- [1] Du, P., Zhao, S., Sheng, Z., Li, C., & Chen, H. (2023). Semi-Supervised Learning via Weight-aware Distillation under Class Distribution Mismatch. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 16410-16420).
- [2] Schmutz, Hugo, Olivier Humbert, and Pierre-Alexandre Mattei (2022). Don’t fear the unlabelled: safe semi-supervised learning via debiasing. The Eleventh International Conference on Learning Representations.
- [3] Azizmalayeri, Mohammad & Abu-Hanna, Ameen & Cinà, Giovanni. (2024). Mitigating Overconfidence in Out-of-Distribution Detection by Capturing Extreme Activations. Conference on Uncertainty in Artificial Intelligence (UAI)

> Weaknesses.1) Combinations of loss functions. It will limit applicability of this method to various settings, since the sweet spot will vary depending on the downstream task.

**Response**) As pointed out, CaliMatch is trained with a combination of different loss functions, and we are aware that some deep neural networks trained with a combination of loss functions may significantly fail at some tasks when the balances between different loss functions are not appropriately addressed. Thus, we have performed a sensitivity analysis of $\lambda_{\text{O}}$ and $\lambda_{\text{OCal}}$ (please refer to the table below). We observed that the downstream task's performance depends on the choice of $\lambda_{\text{O}}$ and $\lambda_{\text{OCal}}$, but we have not observed significant failure of our method for the range of hyperparameters we considered in the analyses. This implies that even if a naive choice of hyperparameter will not guarantee the optimal performance, our method is robust enough not to significantly fail in most cases. Moreover, we also confirmed that with any hyperparameter combination, CaliMatch outperformed other methods.

---

> Weaknesses.2) too many hyperparameters to control.

**Response**) We would like to acknowledge that the existing safe SSL methods share the same limitation -- the existing methods have many hyperparameters as they consist of OOD detector and multiclass classifier. Thus, it is not only our method's unique limitation but a common limitation that has not been addressed yet. Moreover, even if our method has many hyperparameters, some of them can readily be chosen and work well in general. For example, in most cases, $\tau_{1}$ and $\tau_{2}$ could be fixed at 0.5 and 0.95 as, same as in OpenMatch, and we confirmed that with such choice of hyperparameters, our method outperforms the existing safe SSL methods. Besides, we further performed a sensitivity analysis on CIFAR-10 with 60\% mismatch ratio for $\lambda_{\text{O}}$ and $\lambda_{\text{OCal}}$ and summarized the result in the below table. Although different set of hyperparameters ($\lambda_{\text{O}}$ and $\lambda_{\text{OCal}}$) yielded different performances, in all cases, CaliMatch outperformed all other safe SSL methods in terms of downstream task accuracy. That is, even without careful fine-tuning of hyperparameters, our method can achieve better performance than the existing methods.

|$\lambda_{\text{O}}$|$\lambda_{\text{OCal}}$|Accuracy in Classification|ECE in Classification|F1 in OOD Detection|ECE in OOD Detection|
|:-:|:-:|:-:|:-:|:-:|:-:|
|0.1|0.1|87.62 (0.36)|0.029 (0.003)|0.883 (0.003)|0.064 (0.005)|
|0.1|0.05|87.94 (0.57)|0.037 (0.010)|0.881 (0.010)|0.066 (0.013)|
|0.1|0.01|87.74 (0.57)|0.041 (0.012)|0.873 (0.006)|0.044 (0.006)|
|0.5|0.1|87.51 (0.33)|0.037 (0.013)|0.875 (0.004)|0.074 (0.014)|
|1|0.1|86.86 (0.22)|0.039 (0.011)|0.872 (0.006)|0.062 (0.011)|

---

> Weaknesses.3) Novelty is limited. There were calibration studies and many SSL methods. Also, there were many robust SSL studies, including when the label set of detected samples and the true label set of unlabeled samples are different, e.g. OpenMatch. What is the strength of this paper over those studies? I saw experiment results and I want to know which factors exactly made those differences and why it should. At least, there should be experiment result showing which factor does what (not just test acc).

**Response**) As pointed out by Reviewer x6WL and Tr6T, the contribution of our work is to demonstrate the importance of calibration in safe SSL, which has not been explored yet. Even if our contribution on methodological development is incremental, we believe that our experimental results and discussions on the importance of calibration in safe SSL are valuable to the machine learning community. Moreover, we further applied different methods to improve the calibration and investigated the efficacy of the proposed temporal scaling approach (please refer to the table in response to W4 of reviewer dKTq). We found that our approach is more effective than popular approaches to improving calibration performance when applied to the safe SSL task.

# To Reviewer dKTq’s Comments

> Weaknesses.1) From the methods section of the paper, it is evident that this work builds upon the existing framework of OpenMatch by integrating a novel calibration loss. Thus CaliMatch does not achieve further methodological innovations.

**Response**) Please refer to our response to W3 of reviewer oWkk.

---

> Question.1) The paper introduces two techniques without delving deeply into theoretical analysis, relying instead on empirical validation. This approach may lack sufficient persuasive power regarding the effectiveness of the methods.

**Response**) We plan to theoretically investigate why our two calibrated scores $c_i^u$ and $s_i^u$ improved pseudo-label quality and effectively rejected the unlabeled OOD samples in safe SSL. Please refer to our response to W3 of reviewer x6WL.

---

> Question.2) The method proposed in this paper validates using a labeled data validation set. However, the issue of model overconfidence typically arises from unlabeled data. If this issue could be effectively mitigated solely through labeled data, many existing techniques would similarly demonstrate effectiveness. Labeled data alone does not appear to address the confirmation bias caused by model self-training, which remains a longstanding challenge in the field.

**Response**) We have thought that the use of a labeled validation dataset is the most direct way of estimating the model's current accuracy through our $\Gamma$ and $\Delta$. We then proposed to align the model's overconfidence with the estimated model's accuracy through our adaptive label smoothing with scaling factors $T_M$ and $T_O$. Compared to other existing calibration methods, our adaptive label smoothing with temperature scaling achieved the best results in improving the efficacy of safe SSL. For more details, please refer to our response to your third question. We also agree with the importance of using unlabeled data for calibration and intend to expand our technique as per your suggestion in the future.

---

> Question.3) The label smoothing method employed in this paper is a classic technique in the field of model calibration. However, the paper seems to lack in-depth discussion on model calibration, especially in comparison with other methods like mixup.

**Response**) Existing calibration methods, such as label smoothing and mix-up, do not consider the model's current accuracy when determining the degree of calibration needed. Consequently, they often yield suboptimal calibration improvements. In contrast, our $\Gamma$ and $\Delta$ use a labeled validation set to estimate the current accuracy and help to determine the necessary level of label smoothing. We then align the model's confidence distribution with this accuracy through adaptively smoothed labels with $T_M$ and $T_O$. The two learnable factors optimize themselves to stabilize the calibration process as the models learn the adaptively smoothed labels. This stable characteristic would be a valuable factor when it is applied to other frameworks.

To support our claims, we present additional experimental results on CIFAR-10 with the 60\% mismatch ratio to identify the most effective calibration method for safe SSL. At first, we applied classic label smoothing and mix-up to both the multiclass classifier and the OOD detector, and compared their results with those of CaliMatch. Our calibration method in CaliMatch outperformed existing methods, achieving the best accuracy and ECE, making it the most effective helper for the safe SSL method. Mix-up also improved the efficacy of OpenMatch by demonstrating better accuracy and ECE, but its improvements were suboptimal compared to our adaptive label smoothing with two scaling factors. When the classic label smoothing was applied to both the classifier and OOD detector of OpenMatch, the OvR binary classifiers in the OOD detector exhibited instability due to gradient explosion, failing to sustain SSL training. This result highlights the usage of our learnable parameter $T_M$ in OOD calibration. When we applied classic label smoothing only to the multiclass classifier, OpenMatch showed some improvement in multiclass classification. However, it was still suboptimal compared to our calibration method. Thank you for your valuable suggestion, and we will include this discussion in the revised manuscript.

| Method | Calibrating Classification | Calibrating OOD Detection | Accuracy in Classification | ECE in Classification | F1 in OOD Detection | ECE in OOD Detection |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| OpenMatch | X | X | 86.19 (0.74) | 0.115 (0.007) | 0.881 (0.013) | 0.126 (0.006) |
| OpenMatch with Label Smoothing | O | X | 86.84 (0.40) | 0.082 (0.004) | 0.854 (0.008) | 0.121 (0.008) |
| OpenMatch with Label Smoothing | O | O | 83.49 (0.86) | 0.074 (0.017) | **0.887 (0.015)** | 0.070 (0.014) |
| OpenMatch with Mix-up | O | O | 86.57 (0.98) | 0.060 (0.025) | 0.878 (0.024) | 0.065 (0.025) |
| **CaliMatch** | O | O | **87.62 (0.36)** | **0.029 (0.003)** | 0.883 (0.003) | **0.064 (0.005)** |

# To Reviewer Tr6T’s Comments

> Weaknesses.1) The authors here shed light on an important problem but methodologically the additions relative to OpenMatch are limited (CaliMatch is OpenMatch with added calibration of OOD detector and classifier) and the calibration method used is the standard temperature scaling technique optimised with a calibration loss during training, not novel per se. However, I still think that the depth of the experiments and the thorough analysis of the impact of adding calibration to existing safe semi-supervised baseline provides valuable new insights and highlights an important problem.

Response) We appreciate your positive comments, thorough reviews, and recognition of our contributions. To emphasize the novelty and effectiveness of our calibration method in CaliMatch, we have detailed the differences between our adaptive label smoothing with temperature scaling and other existing calibration methods in our response to reviewer dKTq's questions. In that response, we have presented additional experimental results to support our claim. We would be grateful if you could check that response as well.

> Weaknesses.2) It would have been nice to have results on a bigger dataset, with more realistic tasks (e.g. full ImageNet, one dataset of the WILDS benchmark or a medical dataset).

Response) We appreciate your interest in seeing our results on larger datasets with more realistic tasks. In our future research, especially in the context of distribution shift (in addition to label shift), we will follow your suggestion and use the WILDS benchmark to evaluate the robustness and practical applicability of our proposed method.

# Don't delete

Our proposed framework distinguishes itself and advances beyond existing approaches in the following key ways:
- To the best of our knowledge, we are the first to introduce a new safe SSL framework that emphasizes the importance of calibration for both multiclass classifier and  OOD detector.
- We propose two well-calibrated scores, confidence and OOD score, through our adaptive label smoothing with temperature scaling for safe SSL.
- Our framework's effectiveness is validated through extensive experiments on various image benchmark datasets, achieving state-of-the-art performance among existing safe SSL methods.
- Our works can provide valuable insights and practical solutions for the SSL research community, making a solid contribution to safe SSL methods.


Response) We agree that the choice of $\tau_{1}$ could be crucial for the effectiveness of our method. Thus, we have done a sensitivity analysis on CIFAR-10 with the 60\% mismatch ratio. First of all, we found that the downstream task accuracy depends on the choice of $\tau_{1}$ as the reviewer's concern, but in all cases, CaliMatch outperformed OpenMatch. Second, following your suggestion, we have estimated $\tau_1$ in a data-driven manner using the validation set and confirmed its efficacy. We compute $\tau_1$ as the average of CaliMatch's seen-class scores $s_i$ on the in-distribution validation samples and iteratively update $\tau_1$ per epoch during training using a weighted moving average with $\alpha = 0.99$ for stability. The experimental results below support that this adjustment mechanism has the potential to guide practitioners in selecting an appropriate $\tau_1$ with its descent the accuracy of downstream task compared to OpenMatch. We appreciate your insight, and if time permits, we will investigate this guideline further for other datasets.

|$\tau_1$|CaliMatch|OpenMatch|
|:-:|:-:|:-:|
|0.5|87.62 (0.41)|86.22 (0.81)|
|0.6|87.62 (0.31)|86.29 (0.77)|
|0.7|88.13 (0.16)|86.22 (0.67)|
|0.8|87.72 (0.45)|86.37 (0.67)|
|Adaptive Adjustment|87.61 (0.30)|.| 

> Weaknesses.1) While the final experimental results are quite good, the innovation of using calibration techniques to calibrate the OOD detector and model is limited, and the inspiration it provides to the field is also limited.

Response to W.1) To the best of our knowledge, we are the first to emphasize the importance of studying calibration methods for both multiclass classifiers and OOD detectors in the context of SSL under label distribution mismatch. To achieve its significance, we proposed the use of our adaptive label smoothing with temperature scaling, designed to be applicable to both multiclass classifiers and OOD detectors. Through extensive experimental results, including logging plots and ablation studies on various image classification benchmark datasets, we thoroughly discussed the importance of calibration, how it specifically works within SSL under label distribution mismatch, and its efficacy for safe SSL. We believe these results can also make a solid contribution to the safe SSL community.

> Weaknesses.2) Based on the observations from Figure 3, it is evident that $\tau_1$ to some extent influences the method's effectiveness. Selecting an appropriate for different datasets appears to be crucial. I am uncertain whether a fixed $\tau_1$ can ensure optimal performance across diverse datasets. It might be beneficial to conduct experiments on additional datasets or consider an adaptive adjustment mechanism for $\tau_1$.

Response to W.2) Following your suggestion, we have explored estimating $\tau_1$ in a data-driven manner using the validation set and confirmed its efficacy on CaliMatch, as shown in Figure 3, using CIFAR-10 with $\kappa$ set to 60\%. Specifically, we compute $\tau_1$ as the average of CaliMatch's seen-class scores $s_i$ on the in-distribution validation samples and iteratively update $\tau_1$ per epoch during training using a weighted moving average with $\alpha = 0.99$ for stability. The experimental results below demonstrate that this adjustment mechanism effectively guides the selection of an appropriate $\tau_1$ during the training of CaliMatch. This is evidenced by CaliMatch achieving superior performance in multiclass classification with the adaptive adjustment mechanism. We appreciate your insight, and if time permits, we will follow your suggestion to apply these guidelines to additional datasets. 

   
<br>

> Weaknesses.3) Although the feasibility of the method has been analyzed from an empirical perspective, incorporating a theoretical analysis could make this work more solid with more insights and inspiration for researchers in the field.

Response to W.3) We acknowledge that this paper does not include a thorough theoretical analysis, and we hope to conduct a more detailed analysis in future research, building upon the following preliminary empirical risk analysis. The analysis below can support the importance of two key factors in the safe SSL method, as achieved by well-calibrated CaliMatch: i) selecting a reliable subset of unlabeled seen-class data from $B_u$ to minimize the harmful SSL effects from unlabeled unseen-class examples, and ii) applying consistency regularization to the selected data for a multiclass classifier, ensuring that their pseudo-labels are accurate.

When we define an ideal unlabeled mini-batch as $B_u' = \{(x_{i}^{u}, y_{i}^{u}) \in B_u \times \mathcal{Y}:i=1,\cdots,n_{u'}\}$, which has true labels of unlabeled samples and does not include unseen-class instances, supervised loss for $B_u'$ and SSL loss for $B_u$ based on FixMatch can be formulated as follows:
\begin{align}
    &\mathcal{L'}_{\text{Fix}}(B_u';\mathcal{T}_w,\mathcal{T}_s)=\sum_{i=1}^{|B_{u}'|}\mathbb{I}\big(c(x_i^u)>\tau_2 \big)\sum_{k=1}^K -\mathbb{I}(y_i^u = k) \log p_{k}( \mathcal{T}_{s}(x_{i}^{u})), \\
    & \mathcal{L'}_{\text{Fix}}(B_u;\mathcal{T}_w,\mathcal{T}_s)  =\sum_{i=1}^{|B_u|}\mathbb{I}\big(c(x_i^u)>\tau_2 \big)\sum_{k=1}^K -\mathbb{I}\big(\text{argmax}_{l} p_{l}( \mathcal{T}_{w}(x_{i}^{u})) = k \big) \log p_{k}( \mathcal{T}_{s}(x_{i}^{u})),
\end{align}
where $c(x_i^u)$ represents confidence $\text{max}_{k \in \mathcal{Y}} p_{k}( \mathcal{T}_{w}(x_{i}^{u}))$ of $x_i^u$ in multiclass classification. Furthermore, when we define an unlabeled unseen-class mini-batch as $B_u''$ from $B_u$, a safe SSL error based on FixMatch can be bounded by two terms related to the two key factors:
\begin{align}
    & \underbrace{\Big| \mathcal{L'}_{\text{Fix}}(B_u';\mathcal{T}_w,\mathcal{T}_s) - \mathcal{L'}_{\text{Fix}}(B_u;\mathcal{T}_w,\mathcal{T}_s) \Big|}_{\text{Safe SSL error}} \\
    & \leq \underbrace{\Big| \mathcal{L'}_{\text{Fix}}(B_u';\mathcal{T}_s) - \mathcal{L'}_{\text{Fix}}(B_u \setminus B_u'';\mathcal{T}_w,\mathcal{T}_s) \Big|}_{\text{Related to the second-key factor, i.e., ii) }}+ \underbrace{\Big| \mathcal{L'}_{\text{Fix}}(B_u'';\mathcal{T}_w,\mathcal{T}_s)\Big|.}_
    {\text{Related to the first-key factor, i.e., i)}}
\end{align}
The safe SSL error represents the discrepancy between the FixMatch losses for the Oracle mini-batch data $B_u'$ and the realistic mini-batch data $B_u$ under a label distribution mismatch. Ensuring high-quality pseudo-labels based on the confidence $c(x_i^u)$ among the unlabeled seen-class mini-batch $B_u \setminus B_u''$ reduces the first bound term of the safe SSL error. However, the prevalent issue of overconfidence in multiclass classifiers hinders this reduction. This underscores the importance of calibration in multiclass classification, which we achieve through our well-calibrated confidence score $c_i^u$. Similarly, excluding the unlabeled unseen-class mini-batch $B_u''$ when calculating $\mathcal{L'}_{\text{Fix}}$ reduces the second bound term of the safe SSL error. Despite the notorious overconfidence in OOD rejection preventing the second bound term from decreasing, CaliMatch addresses this issue by utilizing our well-calibrated seen-class score $s_i^u$.

> Question.1) Given that OOD methods and calibration methods have been previously explored, in what ways does the proposed framework distinguish itself or advance beyond existing approaches in terms of innovation or application?

Response to Q.1) We can agree that the framework that jointly utilizes the OOD detection mechanism and calibration technique may not seem novel. However, within the context of safe SSL under label distribution mismatch, the proposed framework illuminated the important issue of calibration. Please refer to our response to \textbf{Weakness-1} that you mentioned.

> Question.2) Does CaliMatch have the capability to detect outliers that closely resemble inliers, addressing a common challenge in outlier detection?

Response to Q.2) As you can see in Table 7, CaliMatch's OOD detection performance is not significantly better than that of OpenMatch. Therefore, it is difficult to say that CaliMatch has a better OOD detection capability compared to OpenMatch when it comes to samples that closely resemble inliers.

Response to W.1) To the best of our knowledge, we are the first to emphasize the importance of studying calibration methods for both multiclass classifiers and OOD detectors in the context of SSL under label distribution mismatch. To achieve its significance, we proposed the use of our adaptive label smoothing with temperature scaling, designed to be applicable to both multiclass classifiers and OOD detectors. Through extensive experimental results, including logging plots and ablation studies on various image classification benchmark datasets, we thoroughly discussed the importance of calibration, how it specifically works within SSL under label distribution mismatch, and its efficacy for safe SSL. We believe these results can also make a solid contribution to the safe SSL community.

Response to W.2) Following your suggestion, we have explored estimating $\tau_1$ in a data-driven manner using the validation set and confirmed its efficacy on CaliMatch, as shown in Figure 3, using CIFAR-10 with $\kappa$ set to 60\%. Specifically, we compute $\tau_1$ as the average of CaliMatch's seen-class scores $s_i$ on the in-distribution validation samples and iteratively update $\tau_1$ per epoch during training using a weighted moving average with $\alpha = 0.99$ for stability. The experimental results below demonstrate that this adjustment mechanism effectively guides the selection of an appropriate $\tau_1$ during the training of CaliMatch. This is evidenced by CaliMatch achieving superior performance in multiclass classification with the adaptive adjustment mechanism. We appreciate your insight, and if time permits, we will follow your suggestion to apply these guidelines to additional datasets.

|$\tau_1$|CaliMatch|OpenMatch|
|:-:|:-:|:-:|
|0.5| 87.62 (0.41)|86.22 (0.81)|
|0.6| 87.62 (0.31)|86.29 (0.77)|
|0.7| 88.13 (0.16)|86.22 (0.67)|
|0.8| 87.72 (0.45)|86.37 (0.67)|
| Adaptive Adjustment| 87.61 (0.30)|.|

Response to W.3) We acknowledge that this paper does not include a thorough theoretical analysis, and we hope to conduct a more detailed analysis in future research, building upon the following preliminary empirical risk analysis. The analysis below can support the importance of two key factors in the safe SSL method, as achieved by well-calibrated CaliMatch: i) selecting a reliable subset of unlabeled seen-class data from $B_u$ to minimize the harmful SSL effects from unlabeled unseen-class examples, and ii) applying consistency regularization to the selected data for a multiclass classifier, ensuring that their pseudo-labels are accurate.

When we define an ideal unlabeled mini-batch as $B_u' = \{(x_{i}^{u}, y_{i}^{u}) \in B_u \times \mathcal{Y}:i=1,\cdots,n_{u'}\}$, which has true labels of unlabeled samples and does not include unseen-class instances, supervised loss for $B_u'$ and SSL loss for $B_u$ based on FixMatch can be formulated as follows:
$$
    \mathcal{L'}_{\text{Fix}}(B_u';\mathcal{T}_w,\mathcal{T}_s) &= \sum_{i=1}^{|B_{u}'|} \mathbb{I}\big(c(x_i^u)>\tau_2 \big) \sum_{k=1}^K -\mathbb{I}(y_i^u = k) \log p_{k}( \mathcal{T}_{s}(x_{i}^{u})), \\
    \mathcal{L'}_{\text{Fix}}(B_u;\mathcal{T}_w,\mathcal{T}_s) &= \sum_{i=1}^{|B_u|} \mathbb{I}\big(c(x_i^u)>\tau_2 \big) \sum_{k=1}^K -\mathbb{I}\big(\text{argmax}_{l} p_{l}( \mathcal{T}_{w}(x_{i}^{u})) = k \big) \log p_{k}( \mathcal{T}_{s}(x_{i}^{u})),
$$

where $c(x_i^u)$ represents confidence $\text{max}_{k \in \mathcal{Y}} p_{k}( \mathcal{T}_{w}(x_{i}^{u}))$ of $x_i^u$ in multiclass classification. Furthermore, when we define an unlabeled unseen-class mini-batch as $B_u''$ from $B_u$, a safe SSL error based on FixMatch can be bounded by two terms related to the two key factors:
\begin{align}
    & \underbrace{\Big| \mathcal{L'}_{\text{Fix}}(B_u';\mathcal{T}_w,\mathcal{T}_s) - \mathcal{L'}_{\text{Fix}}(B_u;\mathcal{T}_w,\mathcal{T}_s) \Big|}_{\text{Safe SSL error}} \\
    & \leq \underbrace{\Big| \mathcal{L'}_{\text{Fix}}(B_u';\mathcal{T}_s) - \mathcal{L'}_{\text{Fix}}(B_u \setminus B_u'';\mathcal{T}_w,\mathcal{T}_s) \Big|}_{\text{Related to the second-key factor, i.e., ii) }}+ \underbrace{\Big| \mathcal{L'}_{\text{Fix}}(B_u'';\mathcal{T}_w,\mathcal{T}_s)\Big|.}_
    {\text{Related to the first-key factor, i.e., i)}}
\end{align}
The safe SSL error represents the discrepancy between the FixMatch losses for the Oracle mini-batch data $B_u'$ and the realistic mini-batch data $B_u$ under a label distribution mismatch. Ensuring high-quality pseudo-labels based on the confidence $c(x_i^u)$ among the unlabeled seen-class mini-batch $B_u \setminus B_u''$ reduces the first bound term of the safe SSL error. However, the prevalent issue of overconfidence in multiclass classifiers hinders this reduction. This underscores the importance of calibration in multiclass classification, which we achieve through our well-calibrated confidence score $c_i^u$. Similarly, excluding the unlabeled unseen-class mini-batch $B_u''$ when calculating $\mathcal{L'}_{\text{Fix}}$ reduces the second bound term of the safe SSL error. Despite the notorious overconfidence in OOD rejection preventing the second bound term from decreasing, CaliMatch addresses this issue by utilizing our well-calibrated seen-class score $s_i^u$.

Response to Q.1) We can agree that the framework that jointly utilizes the OOD detection mechanism and calibration technique may not seem novel. However, within the context of safe SSL under label distribution mismatch, the proposed framework illuminated the important issue of calibration. Please refer to our response to \textbf{Weakness-1} that you mentioned.

Response to Q.2) As you can see in Table 7, CaliMatch's OOD detection performance is not significantly better than that of OpenMatch. Therefore, it is difficult to say that CaliMatch has a better OOD detection capability compared to OpenMatch when it comes to samples that closely resemble inliers.

$$\sum\limits_{i=1}^{\infty}$$