# KLUE-STS with Gemini 2.5 Flash-Lite
- Created: 2025-06-26 (Thu)
- Updated: 2025-06-27 (Fri)

## 1. Environment Set-up
- scikit-learn is used to evaluate the performance report, e.g. F1 Score

In [57]:
%pip install --quiet datasets scikit-learn tqdm

Note: you may need to restart the kernel to use updated packages.


## 2. Vertex AI Gemini Set-up

In [58]:
%pip install --upgrade --quiet google-genai

Note: you may need to restart the kernel to use updated packages.


### Restart kernel after installs so that your environment can access the new packages

In [59]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

- Skip running the following cell if you use Vertex AI Workbench.
- Run it only for Colab Enterprise.

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

In [1]:
from IPython.display import HTML, Image, Markdown, display
from google import genai
from google.genai.types import GenerateContentConfig
import os

PROJECT_ID = "[your-project-id]"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "global")

print(f"PROJECT_ID={PROJECT_ID}")
print(f"LOCATION={LOCATION}")

PROJECT_ID=vertex-workbench-notebook
LOCATION=us-central1


In [2]:
client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

## 3. Load the dataset
The total number of samples in KLUE-STS validation is known to be "519". 
However this number will be verified and saved to a variable `total_num_of_samples`.

Without this number, it will be difficult to judge how much progress is made with the entire dataset and the user should wait blindlessly.
It's a better practice to show the progress like:

```bash
Processing the full validation dataset...
Evaluating gemini-2.5-flash with the entire dataset of (519 samples):   8%|▊         | 41/519 [05:09<54:40,  6.86s/it]  
```

In [3]:
from datasets import load_dataset

# Set benchmark dataset and task variables
benchmark_dataset = "klue"
benchmark_task    = "sts"

# Load the dataset *without* streaming to get its length
# This will be necessary to loop through the entire data
non_streaming_dataset = load_dataset(benchmark_dataset, benchmark_task, split='validation', streaming=False)
total_num_of_samples = len(non_streaming_dataset)

print(f"The total number of samples in {benchmark_dataset.upper()}-{benchmark_task.upper()} validation is: {total_num_of_samples}")

The total number of samples in KLUE-STS validation is: 519


In [4]:
# Load the dataset *with* streaming
print(f"Loading {benchmark_dataset.upper()}-{benchmark_task.upper()} validation dataset...")
klue_sts_validation = load_dataset(benchmark_dataset, benchmark_task, split='validation', streaming=True)
print(klue_sts_validation)

Loading KLUE-STS validation dataset...
IterableDataset({
    features: ['guid', 'source', 'sentence1', 'sentence2', 'labels'],
    num_shards: 1
})


## 4. Model Set-up
Double-check the recent `MODEL_ID` at the official document "Gen AI on Vertex AI > Doc. > [Gemini 2.5 Flash-Lite](https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash-lite)".

In [5]:
MODEL_ID = "gemini-2.5-flash-lite-preview-06-17"  # @param {type: "string"}

In [6]:
system_instruction = """
[역할 정의]
당신은 두 개의 한국어 문장이 주어졌을 때, 두 문장의 '의미'가 얼마나 유사한지를 판단하는 AI 언어 평가 전문가입니다. 
문장의 구조나 사용된 단어가 다르더라도, 문맥과 핵심 의미를 파악하여 유사성을 평가해야 합니다.

[작업 절차]
입력으로 주어진 문장 1과 문장 2의 핵심 의미를 각각 분석합니다. 
아래 **[평가 기준]**에 따라 두 문장의 의미적 관계를 판단합니다.
**[출력 형식]**에 맞춰 결과를 한 줄로 생성합니다.

[평가 기준]
1. Binary Label (0 또는 1)
1 (유사): 두 문장의 핵심 의도나 정보가 사실상 동일하다고 볼 수 있는 경우. 한 문장이 다른 문장의 요약, 부연 설명이거나, 같은 사실을 다른 표현으로 말하는 경우를 포함합니다.
0 (다름): 두 문장이 전달하는 핵심 정보나 의도가 명백히 다른 경우. 같은 주제를 다루더라도 초점이 다르거나, 서로 관련이 없는 내용을 말하는 경우는 '다름'으로 판단합니다.

2. Real-valued Label (0.0 ~ 5.0)
5.0: 완전 동일: 문장 부호, 띄어쓰기, 조사 등 사소한 차이만 있을 뿐, 의미가 100% 동일합니다.
예: "나는 밥을 먹는다" vs "나는 밥을 먹어"

4.0 ~ 4.9: 거의 동일: 사용된 어휘나 문장 구조는 다르지만, 전달하는 핵심 정보와 뉘앙스가 완전히 동일합니다.
예: "오늘 날씨가 정말 좋다" vs "오늘 날씨가 무척 화창하네"

3.0 ~ 3.9: 대체로 유사: 핵심 정보는 같지만, 부가 정보가 추가되거나 생략되어 약간의 의미 차이가 발생합니다.
예: "나는 아침으로 밥을 먹었다" vs "나는 밥을 먹었다"

2.0 ~ 2.9: 주제는 같으나 초점은 다름: 같은 주제나 상황에 대해 이야기하지만, 각 문장이 강조하는 지점이나 전달하는 정보가 다릅니다.
예: "배가 고파서 식당에 갔다" vs "그 식당의 김치찌개는 정말 맛있다"

1.0 ~ 1.9: 간접적 연관성만 있음: 공통된 단어가 있거나 소재가 겹치지만, 두 문장이 말하고자 하는 바는 완전히 다릅니다.
예: "나는 어제 축구를 봤다" vs "손흥민은 대단한 축구 선수다"

0.0 ~ 0.9: 전혀 관련 없음: 두 문장 사이에 어떠한 의미적 연관성도 찾을 수 없습니다.
예: "내일 회의는 3시에 시작합니다" vs "고양이는 귀여운 동물이다"

[출력 형식]
binary-label 값과 real-label 값을 쉼표(,)로 구분하여 한 줄에 출력합니다.
형식: binary-label: [값], real-label: [값]

[예시]
입력:
문장1: "코로나19의 전 세계적 유행으로 인해 해외여행이 어려워졌다."
문장2: "펜데믹 상황 때문에 사람들이 국외로 나가는 것이 힘들어졌다."
출력:
binary-label: 1, real-label: 4.5

입력:
문장1: "이 영화 정말 재미있더라."
문장2: "그 영화 주인공 연기가 인상 깊었어."
출력:
binary-label: 0, real-label: 2.8

입력:
문장1: "노트북 배터리가 거의 다 닳았네."
문장2: "오늘 저녁 메뉴는 뭘로 할까?"
출력:
binary-label: 0, real-label: 0.0
"""

In [7]:
def create_prompt(sentence1, sentence2):
    prompt = f"""
문장1: {sentence1}
문장2: {sentence2} 
"""
    return prompt.strip()

## 5. Test with a sample prompt

In [8]:
sample = next(iter(klue_sts_validation))
sample_prompt = create_prompt( sample['sentence1'], sample['sentence2'] )
print(sample_prompt)

문장1: 무엇보다도 호스트분들이 너무 친절하셨습니다.
문장2: 무엇보다도, 호스트들은 매우 친절했습니다.


In [9]:
prompt = sample_prompt

In [10]:
response = client.models.generate_content(
    model=MODEL_ID,
    contents=prompt,
    config=GenerateContentConfig(
        temperature=0.0,  # 0 for consistency
        system_instruction=system_instruction,
        #top_p=0.95,
        #candidate_count=1,
        #thinking_config=thinking_config,
    ),
)
display(Markdown(response.text))

ClientError: 404 NOT_FOUND. {'error': {'code': 404, 'message': 'Publisher Model `projects/vertex-workbench-notebook/locations/us-central1/publishers/google/models/gemini-2.5-flash-lite-preview-06-17` was not found or your project does not have access to it. Please ensure you are using a valid model version. For more information, see: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions', 'status': 'NOT_FOUND'}}

## 6. Test the loop with only ten samples

In [11]:
import time
import re
from tqdm import tqdm  # Show the progress
import itertools # Import itertools to safely slice the dataset

# Initialize variables
binary_predictions = []
real_predictions   = []
true_binary_labels = []
true_real_labels   = []
# To store sentences for the results table
true_sentences1    = [] 
true_sentences2    = []

error_count = 0
num_test_samples = 10
sleep_interval_between_api_calls = 0.03 # sec
#description = f"Evaluating with {MODEL_ID}"
description = f"Evaluating {MODEL_ID} on {num_test_samples} samples"

In [36]:
# Main evaluation loop
for i, sample in enumerate(tqdm(itertools.islice(klue_sts_validation, num_test_samples), desc=description, total=num_test_samples), 1):
    sentence1 = sample['sentence1']
    sentence2 = sample['sentence2']
    sample_prompt = create_prompt(sentence1, sentence2)

    ground_truth_binary = sample['labels']['binary-label']
    ground_truth_real = sample['labels']['real-label']
    
    try:
        response = client.models.generate_content(
            model=MODEL_ID,
            contents=sample_prompt,
            config=GenerateContentConfig(
                temperature=0.0,  # 0 for consistency
                system_instruction=system_instruction,
            ),
        )
        model_output = response.text.strip()

        # Parse model_output with regular expression
        match = re.search(r"binary-label:\s*([01])\s*,\s*real-label:\s*([0-9.]+)", model_output)

        if match:
            # Extract values and convert types
            b_pred = int(match.group(1))
            r_pred = float(match.group(2))
            
            # Append results to lists
            binary_predictions.append(b_pred)
            real_predictions.append(r_pred)
            true_binary_labels.append(ground_truth_binary)
            true_real_labels.append(ground_truth_real)
            # Store sentences
            true_sentences1.append(sentence1)
            true_sentences2.append(sentence2)

        else:
            error_count += 1
            print(f"\n----- Sample {i}/{num_test_samples} (Format Error) -----")
            print(f"Mismatched model output: {model_output}")
            
    except Exception as e:
        print(f"An error occurred: {e}")
        error_count += 1
        print(f"\n Sample {i}/{num_test_samples} (API Error)")
        print(f"An error occurred: {e}")

    # To prevent overloading the API
    time.sleep( sleep_interval_between_api_calls )

print(f"\nEvaluation Finished.")
print(f"Total samples processed: {len(binary_predictions) + error_count}")
print(f"Successful predictions: {len(binary_predictions)}")
print(f"Format errors or API issues: {error_count}")

Evaluating gemini-2.5-flash on 10 samples: 100%|██████████| 10/10 [01:05<00:00,  6.56s/it]


Evaluation Finished.
Total samples processed: 10
Successful predictions: 10
Format errors or API issues: 0





## 7. Evaluation
- For binary-labels, calculate the classification_report. 
- For real-labels, calculate regression metrics such as mean squared error (MSE) and mean absolute error (MAE).

In [37]:
import numpy as np
import pandas as pd  # To create the results table
from sklearn.metrics import (
    classification_report,
    accuracy_score,
    mean_squared_error,
    mean_absolute_error,
)
from scipy.stats import pearsonr

import numpy as np
from sklearn.metrics import (
    classification_report,
    accuracy_score,
    mean_squared_error,
    mean_absolute_error,
)
from scipy.stats import pearsonr

# --- Assume these lists are populated from the previous evaluation loop ---
# For demonstration purposes, let's create some dummy data.
# In your actual code, these lists will be filled by the loop.
# true_sentences1 = ['Sentence 1A', 'Sentence 1B'] 
# true_sentences2 = ['Sentence 2A', 'Sentence 2B']
# true_binary_labels = [1, 0]
# true_real_labels = [4.5, 1.2]
# binary_predictions = [1, 1]
# real_predictions = [4.2, 2.0]

print("\n===== Sample-by-Sample Comparison Table =====")

if true_binary_labels and binary_predictions:
    # Create a dictionary with the results
    results_data = {
        'Sentence 1': [s1 for s1, s2 in zip(true_sentences1, true_sentences2)],
        'Sentence 2': [s2 for s1, s2 in zip(true_sentences1, true_sentences2)],
        'True Binary': true_binary_labels,
        'Pred Binary': binary_predictions,
        'True Real': [f"{x:.2f}" for x in true_real_labels],
        'Pred Real': [f"{x:.2f}" for x in real_predictions]
    }
    

    # Create and display the pandas DataFrame
    results_df = pd.DataFrame(results_data)
    
    # Set display options to show full text in columns
    #pd.set_option('display.max_colwidth', None)
    #pd.set_option('display.width', 1000)
    #print(results_df.to_string())
    # -> The output table looks messy!
    
    # Iterate over the DataFrame and print each sample in a structured, readable format
    for index, row in results_df.iterrows():
        print(f"\nSample {index + 1}/{len(results_df)}")
        print(f"Sentence 1: {row['Sentence 1']}")
        print(f"Sentence 2: {row['Sentence 2']}")
        print("---------------------------------------------------------")
        print(f"  - Ground Truth : Binary={row['True Binary']}, Real={row['True Real']}")
        print(f"  - Prediction   : Binary={row['Pred Binary']}, Real={row['Pred Real']}")
    print("==========================================================")

else:
    print("\nNo valid predictions to display in the results table.")

print("\n===== Display Evaluation Metrics =====")
print(f"\n\n--- {MODEL_ID} KLUE-STS (Zero-shot) Benchmark Results ---")
print(f"Evaluated on {len(true_binary_labels)} samples.")

print("\n===== Binary Label (Classification) =====")
if true_binary_labels and binary_predictions:
    accuracy = accuracy_score(true_binary_labels, binary_predictions)
    print(f"\nOverall Accuracy: {accuracy:.4f}")

    report = classification_report(
        true_binary_labels,
        binary_predictions,
        target_names=['Different (0)', 'Similar (1)'],
        zero_division=0
    )
    print("\nClassification Report:")
    print(report)
else:
    print("\nCould not calculate classification metrics. No valid binary predictions found.")

    
print("\n===== Real Label (Regression) =====")
if true_real_labels and real_predictions:
    mse = mean_squared_error(true_real_labels, real_predictions)
    print(f"\nMean Squared Error (MSE): {mse:.4f}")

    mae = mean_absolute_error(true_real_labels, real_predictions)
    print(f"Mean Absolute Error (MAE): {mae:.4f}")

    pearson_corr, _ = pearsonr(true_real_labels, real_predictions)
    print(f"Pearson Correlation: {pearson_corr:.4f}")
else:
    print("\nCould not calculate regression metrics. No valid real-valued predictions found.")


===== Sample-by-Sample Comparison Table =====

Sentence 1: 무엇보다도 호스트분들이 너무 친절하셨습니다.
Sentence 2: 무엇보다도, 호스트들은 매우 친절했습니다.
---------------------------------------------------------
  - Ground Truth : Binary=1, Real=4.86
  - Prediction   : Binary=1, Real=5.00

Sentence 1: 주요 관광지 모두 걸어서 이동가능합니다.
Sentence 2: 위치는 피렌체 중심가까지 걸어서 이동 가능합니다.
---------------------------------------------------------
  - Ground Truth : Binary=0, Real=1.43
  - Prediction   : Binary=0, Real=2.50

Sentence 1: 학생들의 균형 있는 영어능력을 향상시킬 수 있는 학교 수업을 유도하기 위해 2018학년도 수능부터 도입된 영어 영역 절대평가는 올해도 유지한다.
Sentence 2: 영어 영역의 경우 학생들이 한글 해석본을 암기하는 문제를 해소하기 위해 2016학년도부터 적용했던 EBS 연계 방식을 올해도 유지한다.
---------------------------------------------------------
  - Ground Truth : Binary=0, Real=1.29
  - Prediction   : Binary=0, Real=2.50

Sentence 1: 다만, 도로와 인접해서 거리의 소음이 들려요.
Sentence 2: 하지만, 길과 가깝기 때문에 거리의 소음을 들을 수 있습니다.
---------------------------------------------------------
  - Ground Truth : Binary=1, Real=3.71
  - Prediction   : Binary=1, R

## Interpreting the results
### Metrics for Binary Classification
- Accuracy: The proportion of total samples for which the model correctly predicted 'similar (1)' or 'different (0)'.
- F1-Score: The harmonic mean of Precision and Recall. 
  - It is a reliable classification performance metric, even when the data is imbalanced. 
  - The F1-score for the "similar (1)" class is typically used as the key metric.

### Metrics for Real-valued Regression Metrics
- RMSE (Root Mean Squared Error)
  - The average magnitude of the error between the model's predicted values and the actual values. 
  - A value closer to 0 signifies that the model has accurately predicted the fine-grained scores between 0.0 and 5.0. 
  - This metric is sensitive to outliers.

- MAE (Mean Absolute Error)
  - The average of the absolute errors. 
  - It is less sensitive to outliers than RMSE and is useful for intuitively interpreting the actual magnitude of the error. 
  - For example, an MAE of 0.5 can be understood as the model having an average error of approximately ±0.5 points.

## 8. Loop through all the validation dataset

In [45]:
import time
import re
from tqdm import tqdm
import itertools

# Configure output files
num_random_samples_to_save = 100 # for quick review
random_samples_filename = f"{benchmark_dataset}-{benchmark_task}-{MODEL_ID}-random_samples_for_review.txt"
full_results_filename = f"{benchmark_dataset}-{benchmark_task}-{MODEL_ID}-full_evaluation_results.csv"

# Initialize variables
binary_predictions = []
real_predictions   = []
true_binary_labels = []
true_real_labels   = []
# To store sentences for the results table
true_sentences1    = [] 
true_sentences2    = []

error_count = 0
sleep_interval_between_api_calls = 0.03 # sec

# Get the total number of samples
try:
    # total_num_of_samples was computed at the beginning BEFORE loading the dataset in the streaming mode
    # Note: len(klue_sts_validation) will fail with "TypeError: object of type 'IterableDataset' has no len()"
    description = f"Evaluating {MODEL_ID} with the entire dataset of ({total_num_of_samples} samples)"
except TypeError:
    # Fallback for datasets that don't have a __len__ method
    total_num_of_samples = None
    description = f"Evaluating {MODEL_ID} on full KLUE-STS dataset"

print(description)

Evaluating gemini-2.5-flash with the entire dataset of (519 samples)


In [46]:
# Main evaluation loop
print("Processing the full validation dataset...")
for i, sample in enumerate(tqdm(klue_sts_validation, desc=description, total=total_num_of_samples), 1):
    sentence1 = sample['sentence1']
    sentence2 = sample['sentence2']
    sample_prompt = create_prompt(sentence1, sentence2)

    ground_truth_binary = sample['labels']['binary-label']
    ground_truth_real = sample['labels']['real-label']
    
    try:
        response = client.models.generate_content(
            model=MODEL_ID,
            contents=sample_prompt,
            config=GenerateContentConfig(
                temperature=0.0, # 0 for consistency
                system_instruction=system_instruction,
            ),
        )
        model_output = response.text.strip()

        match = re.search(r"binary-label:\s*([01])\s*,\s*real-label:\s*([0-9.]+)", model_output)

        if match:
            b_pred = int(match.group(1))
            r_pred = float(match.group(2))
            
            binary_predictions.append(b_pred)
            real_predictions.append(r_pred)
            true_binary_labels.append(ground_truth_binary)
            true_real_labels.append(ground_truth_real)
            true_sentences1.append(sentence1) 
            true_sentences2.append(sentence2)    
        else:
            error_count += 1
            print(f"\n Sample {i}/{total_num_of_samples} (Format Error)")
            print(f"Mismatched model output: {model_output}")

    except Exception as e:
        error_count += 1
        print(f"\n Sample {i}/{total_num_of_samples} (API Error)")
        print(f"An error occurred: {e}")

    time.sleep(sleep_interval_between_api_calls)

print(f"\nEvaluation Finished.")
print(f"Total samples processed: {len(binary_predictions) + error_count}")
print(f"Successful predictions: {len(binary_predictions)}")
print(f"Format errors or API issues: {error_count}")

Processing the full validation dataset...


Evaluating gemini-2.5-flash with the entire dataset of (519 samples): 100%|██████████| 519/519 [1:03:10<00:00,  7.30s/it]


Evaluation Finished.
Total samples processed: 519
Successful predictions: 519
Format errors or API issues: 0





In [47]:
import numpy as np
import pandas as pd
from sklearn.metrics import (
    classification_report,
    accuracy_score,
    mean_squared_error,
    mean_absolute_error,
)
from scipy.stats import pearsonr

# Create the full results DataFrame
if true_binary_labels and binary_predictions:
    results_data = {
        'Sentence 1': true_sentences1,
        'Sentence 2': true_sentences2,
        'True Binary': true_binary_labels,
        'Pred Binary': binary_predictions,
        'True Real': true_real_labels,
        'Pred Real': real_predictions
    }
    results_df = pd.DataFrame(results_data)

    # 1. Save ALL results to a CSV file for later analysis
    try:
        results_df.to_csv(full_results_filename, index=False, encoding='utf-8-sig')
        print(f"\nSuccessfully saved all {len(results_df)} results to '{full_results_filename}'")
    except Exception as e:
        print(f"\nError saving full results to CSV: {e}")

    # 2. Save a random selection of samples to a text file for quick review
    if not results_df.empty and num_random_samples_to_save > 0:
        try:
            num_to_sample = min(num_random_samples_to_save, len(results_df))
            random_samples_df = results_df.sample(n=num_to_sample)
            
            with open(random_samples_filename, 'w', encoding='utf-8') as f:
                f.write(f"--- Randomly Selected Samples for Review ({num_to_sample} of {len(results_df)}) ---\n")
                for index, row in random_samples_df.iterrows():
                    f.write(f"\n====================== Sample (Original Index: {index}) ======================\n")
                    f.write(f"Sentence 1: {row['Sentence 1']}\n")
                    f.write(f"Sentence 2: {row['Sentence 2']}\n")
                    f.write("---------------------------------------------------------\n")
                    f.write(f"  - Ground Truth : Binary={row['True Binary']}, Real={row['True Real']:.2f}\n")
                    f.write(f"  - Prediction   : Binary={row['Pred Binary']}, Real={row['Pred Real']:.2f}\n")
                f.write("\n==========================================================\n")
            print(f"Successfully saved {num_to_sample} random samples to '{random_samples_filename}'")
        except Exception as e:
            print(f"\nError saving random samples to text file: {e}")

else:
    print("\nNo valid predictions were generated to save or analyze.")


# 3. Display Final Evaluation Metrics
print(f"\n\n {MODEL_ID} KLUE-STS (Zero-shot) Benchmark Results")

print("\n===== Binary Label (Classification) =====")
if true_binary_labels and binary_predictions:
    accuracy = accuracy_score(true_binary_labels, binary_predictions)
    print(f"\nOverall Accuracy: {accuracy:.4f}")

    report = classification_report(
        true_binary_labels,
        binary_predictions,
        target_names=['Different (0)', 'Similar (1)'],
        zero_division=0
    )
    print("\nClassification Report:")
    print(report)
else:
    print("\nCould not calculate classification metrics. No valid binary predictions found.")

print("\n===== Real Label (Regression) =====")
if true_real_labels and real_predictions:
    mse = mean_squared_error(true_real_labels, real_predictions)
    print(f"\nMean Squared Error (MSE): {mse:.4f}")

    mae = mean_absolute_error(true_real_labels, real_predictions)
    print(f"Mean Absolute Error (MAE): {mae:.4f}")

    pearson_corr, _ = pearsonr(true_real_labels, real_predictions)
    print(f"Pearson Correlation: {pearson_corr:.4f}")
else:
    print("\nCould not calculate regression metrics. No valid real-valued predictions found.")


Successfully saved all 519 results to 'full_evaluation_results.csv'
Successfully saved 100 random samples to 'random_samples_for_review.txt'


 gemini-2.5-flash KLUE-STS (Zero-shot) Benchmark Results

===== Binary Label (Classification) =====

Overall Accuracy: 0.8439

Classification Report:
               precision    recall  f1-score   support

Different (0)       0.95      0.77      0.85       299
  Similar (1)       0.75      0.95      0.84       220

     accuracy                           0.84       519
    macro avg       0.85      0.86      0.84       519
 weighted avg       0.87      0.84      0.84       519


===== Real Label (Regression) =====

Mean Squared Error (MSE): 1.5102
Mean Absolute Error (MAE): 0.9819
Pearson Correlation: 0.8116
