# Introduction to Active Learning (AL)

This notebook introduces an Active Learning (AL) model to predict properties using iterative sampling and model training

First, we need to install and import the required libraries. One of these libraries is `modAL` (https://modal-python.readthedocs.io/en/latest/content/models/ActiveLearner.html), an active learning framework for Python 3. Built on top of `scikit-learn`, it enables you to rapidly create active learning workflows.

In this notebook, we will:
- Explore the principles of active learning
- Use the modAL library to implement an AL model

**NB!**
    📂 Note:
    To run this notebook successfully, make sure you download the required data files from the GitHub repository: https://github.com/lmoranglez/SummerSchool.git and save them in the same folder as this notebook.

In [None]:
# run this cell once to install the required libraries
#import sys
#!{sys.executable} -m pip install pandas==2.2.3 scikit-learn==1.6.1 matplotlib==3.10.1 seaborn==0.13.2 rdkit==2024.9.6
#!{sys.executable} -m pip install modAL-python==0.4.2.1

In [None]:
# Visualize the libraries in your ipython kernel
#!pip list

In [None]:
# Libraries to import
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from modAL.models import ActiveLearner
from modAL.uncertainty import uncertainty_sampling
from modAL.utils.selection import multi_argmax

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression

# Part 1: EDBO+: an active learning platform

### 1.1. Visualize EDBO+ results

**EDBO+** is an open-source, Bayesian theory-based platform for single- and multi-objective active learning, with an accompanying web application (https://www.edbowebapp.com/). It is used for reaction condition screening.

To illustrate its utility, we begin with a single-objective optimization task: **maximizing reaction yield**.

In the bellow cells, we will visualize the hypothetical chemical space of possible conditions. Later, we will load and analyze a pre-optimized dataset that has been generated using the AL strategy implemented in EDBO+.

*Chemical Space Features*
- Temperature (°C): [0, 20, 40, 80, 100, 120, 140]
- Concentration (M): [0.1, 0.2, 1.0, 1.2, 1.5, 1.6, 2.0]
  

1.1.1. Define and load the chemical space

In [None]:
# Define the entire chemical space without label
Temperature = [0, 20, 40, 80, 100, 120, 140]
Concentration = [0.1, 0.2, 1.0, 1.2, 1.5, 1.6, 2.0]

combinations = []
# Table with all the potential combinations 
for j in Temperature:
    for k in Concentration:
        combinations.append({
                            'Temperature':j,
                            'Concentration':k})
        
chemicalspace = pd.DataFrame(combinations)
chemicalspace['Yield'] = np.nan  # Initialize as empty

In [None]:
# Load data generated after the optimization process
# NB! Ensure these .csv files are in the same folder as the Jupyter notebook.

# Iteration refers to the data provided after the oracle labels
iteration1 = pd.read_csv("example1_label.csv")
iteration2 = pd.read_csv("example2_label.csv")
iteration3 = pd.read_csv("example3_label.csv")
iteration4 = pd.read_csv("example4_label.csv")
iteration5 = pd.read_csv("example5_label.csv")

# Predictions obtained once the training query strategy starts
prediction1 = pd.read_csv("pred_example1.csv")
prediction2 = pd.read_csv("pred_example2.csv")
prediction3 = pd.read_csv("pred_example3.csv")
prediction4 = pd.read_csv("pred_example4.csv")
prediction5 = pd.read_csv("pred_example5.csv")

1.1.2. Visualize the hypothetical  chemical space *before* active learning begins

In [None]:
# plot functions to visualize the results generated by the EDBO+ platform
def plot_single_heatmap(data: pd.DataFrame,  
                       y_col: str = 'Yield',  # Can be 'Yield' or 'Yield_predicted_mean'
                       title: str = None, 
                       figsize: tuple = (5, 4),
                       is_prediction: bool = False): # if True, modifies labels for predicted values
    """
    Create a heatmap for either actual Yield or predicted Yield values.
    """
    # Validate input column
    if y_col not in data.columns:
        raise ValueError(f"Column '{y_col}' not found in DataFrame")
    
    # Pivot data
    heatmap_data = data.pivot(index='Temperature', columns='Concentration', values=y_col)
    
    # Create figure
    plt.figure(figsize=figsize)
    
    # Plot heatmap
    ax = sns.heatmap(
        heatmap_data,
        annot=True,
        fmt=".1f",
        cmap='YlOrRd',
        linewidths=1,
        linecolor='black',
        square=True,
        cbar_kws={'label': f"{'Predicted ' if is_prediction else ''}Yield (%)", 
                 'shrink': 0.8},
        vmin=0,
        vmax=100,
        annot_kws={'fontsize': 11}
    )

    default_title = f"{'Predicted ' if is_prediction else ''}Yield" 
    plt.title(title if title else default_title, fontsize=14, pad=20)
    
    # Axis labels
    ax.set_xlabel('Concentration (M)', fontsize=12)
    ax.set_ylabel('Temperature (°C)', fontsize=12)
    
    # Handle missing values
    for y in range(heatmap_data.shape[0]):
        for x in range(heatmap_data.shape[1]):
            if pd.isna(heatmap_data.iloc[y, x]):
                ax.text(x + 0.5, y + 0.5, 'N/A', 
                       ha='center', va='center', 
                       color='gray', fontstyle='italic')
    
    plt.tight_layout()
    return ax

def plot_combined_results(iteration_data: pd.DataFrame, 
                        prediction_data: pd.DataFrame, 
                        x_var='Concentration', # variable to plot on x-axis ('Concentration' or 'Temperature')
                        figsize=(10, 8), 
                        heatmap_title=None,  # title for heatmap, if you later add it
                        lineplot_title=None): # title for line plot, if you later add it
    """
    Create combined plot with heatmap and prediction lineplot.
    """
    # Validate x_var input
    if x_var not in ['Concentration', 'Temperature']:
        raise ValueError("x_var must be either 'Concentration' or 'Temperature'")
    
    # Set the other variable (y_var) based on x_var choice
    y_var = 'Temperature' if x_var == 'Concentration' else 'Concentration'
    
    # Create figure with two subplots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)
    
    # --- First plot: Heatmap ---
    iteration_data['Yield'] = iteration_data['Yield'].replace('PENDING', np.nan)
    iteration_data['Yield'] = pd.to_numeric(iteration_data['Yield'], errors='coerce')
    
    heatmap_data = iteration_data.pivot(index='Temperature', columns='Concentration', values='Yield')
    
    # Create heatmap
    sns.heatmap(heatmap_data, 
                annot=True, 
                fmt=".1f", 
                cmap='YlOrRd',
                mask=heatmap_data.isnull(),
                linewidths=1,
                linecolor='black',
                square=True,
                cbar_kws={'label': 'Yield (%)', 'shrink': 0.8},
                vmin=0,
                vmax=100,
                annot_kws={'fontsize': 12},
                ax=ax1)
    
    # Set titles
    heatmap_title = heatmap_title or f'Yield Heatmap by Temperature and Concentration'
    ax1.set_title(heatmap_title)
    ax1.set_xlabel('Concentration (M)', fontsize=12)
    ax1.set_ylabel('Temperature (°C)', fontsize=12)
    
    # --- Second plot: Line plot with uncertainty ---
    prediction_data['Yield'] = prediction_data['Yield'].replace('PENDING', np.nan)
    prediction_data['Yield'] = pd.to_numeric(prediction_data['Yield'], errors='coerce')
    
    ground_truth = prediction_data.dropna(subset=['Yield'])
    predictions = prediction_data.copy()
    
    # Sort by x_var for proper line plotting
    predictions = predictions.sort_values(x_var)
    
    # Predicted mean (blue line)
    sns.lineplot(
        data=predictions,
        x=x_var,
        y='Yield_predicted_mean',
        color='blue',
        label='Predicted Yield',
        linewidth=2,
        ax=ax2
    )
    
    # Uncertainty band (mean ± sqrt(variance))
    ax2.fill_between(
        predictions[x_var],
        predictions['Yield_predicted_mean'] - np.sqrt(predictions['Yield_predicted_variance']),
        predictions['Yield_predicted_mean'] + np.sqrt(predictions['Yield_predicted_variance']),
        color='blue',
        alpha=0.2,
        label='Uncertainty (±σ)',
    )
    
    # Ground truth points (red circles)
    ax2.scatter(
        ground_truth[x_var],
        ground_truth['Yield'],
        color='red',
        s=80,
        label='Ground Truth',
        edgecolor='black',
        zorder=10,
    )
    
    # Set axis limits based on data range
    x_min = predictions[x_var].min() - (0.1 * predictions[x_var].max())
    x_max = predictions[x_var].max() * 1.1
    ax2.set_xlim(x_min, x_max)
    
    # Set titles and labels
    lineplot_title = lineplot_title or f"Yield Predictions by {x_var.capitalize()} with Uncertainty"
    ax2.set_title(lineplot_title)
    ax2.set_xlabel(x_var.capitalize())
    
    # Configure legend
    handles, labels = ax2.get_legend_handles_labels()
    ax2.legend(handles, labels, loc='upper right')
    
    plt.tight_layout()
    return fig, (ax1, ax2)


Empty chemical space

In [None]:
# The initial chemical space is empty and contains no labels.
ax = plot_single_heatmap(chemicalspace, 
                   y_col='Yield',
                   is_prediction=False)
plt.show()

Initial **(1)** labeled data introduced: target property distribution (left) and the model created with such (1) labeled data, where `Yield_predicted_mean` vs. `x_var` property ('Temperature' or 'Concentration') (right)

In [None]:
fig, (ax1, ax2) = plot_combined_results(iteration1, prediction1, x_var='Temperature', figsize=(16,6))

**Iteration (2)** labeled data added: Target property distribution (left) and updated model using (2) labeled data points, with `Yield_predicted_mean` vs. `x_var` ('Temperature'/'Concentration') (right).

In [None]:
fig, (ax1, ax2) = plot_combined_results(iteration2, prediction2, x_var='Temperature', figsize=(16,6))

**Iteration (3)** labeled data added: Target property distribution (left) and refined model using (3) labeled data points, showing `Yield_predicted_mean` vs. `x_var` ('Temperature'/'Concentration') (right).

In [None]:
fig, (ax1, ax2) = plot_combined_results(iteration3, prediction3, x_var='Temperature', figsize=(16,6))

**Iteration (4)** labeled data added: Target property distribution (left) and optimized model using (4) labeled data points, comparing `Yield_predicted_mean` vs. `x_var` ('Temperature'/'Concentration') (right).

In [None]:
fig, (ax1, ax2) = plot_combined_results(iteration4, prediction4, x_var='Temperature', figsize=(16,6))

**Iteration (5)** labeled data added: Target property distribution (left) and optimized model using (4) labeled data points, comparing `Yield_predicted_mean` vs. `x_var` ('Temperature'/'Concentration') (right).

In [None]:
fig, (ax1, ax2) = plot_combined_results(iteration5, prediction5, x_var='Temperature', figsize=(16,6))

**Predicted** yield heatmap. \
**NB!**: Only labeled points are verified. Other values are model estimates whose accuracy depends on their uncertainty.

In [None]:
plot_single_heatmap(prediction5, 
                   y_col='Yield_predicted_mean',
                   is_prediction=True,
                   title="Model Predictions (Iteration X) - Predicted Yield",)

### 1.2. Key Observations from Yield Heatmap Analysis

**Temperature Dependence:**
- Yield increase from 0-80°C
- Maximum yield achieved at 40°C with 1.6M catalyst
- Exponential decrease above 80°C suggests:
  • Catalyst thermal decomposition
  • Potential poisoning effects
  • Competing side reactions becoming dominant

**Catalyst Concentration (0.1-2.0M):**
- No significant correlation with yield observed
- Indicates either:
  1) Catalyst saturation already at minimum concentration
  2) Non-rate-limiting role in this concentration range

**Recommended Next Steps:**
1. Expand low concentration testing (0.01-0.1M)
2. Conduct thermal stability studies

### 1.3. Next steps: customize and run your own EDBO+ optimization

Now that you've evaluated EDBO+ with a predefined example, you can explore new optimizations by modifying the parameters. Below are the steps to recreate the previous example with your own adjustments.

Key Requirements for Visualization

To use the previous plotting functions, maintain these exact variable names: `Temperature`, `Concentration` and `Yield`. However, you can modify the values of the `Temperature`, `Concentration` features (within the EDBO+ interface and in the cell of the definition of the chemical space above in section 1.1.1.), as well as  the objective of the `Yield` (maximize, minimize, or range).

Web application (https://www.edbowebapp.com/).

**Steps to follow:**

Access the EDBO+ Web interface and, select Create New Scope `+ Build`.

- Set the name of your scope to: `name`
- `Add Features`:
  - Temperature: # replace by your new suggestions, e.g. 0, 20, 40, 80, 100, 120, 140 
  - Concentration: # replace by your new suggestions; e.g. 0.1, 0.2, 1.0, 1.2, 1.5, 1.6, 2.0
    
Click `Create scope`

Run Optimization, to `Optimize` (*learner*), and select the previously defined scope: `name`, and then click on `Setup Optimizer`.

- Select the features you want to use. Choose: `All`
- Add Objectives:
      `Yield`,
      `maximize` (*define the query strategy*)
- Number of experiments to run: `3` (*experiments to label*)
- Click `Run optimizer` (*run the learner model to get predictions*)
- Click `Download scope` (*this prompts the query to label*)
    Label Experiments:
- In the downloaded scope file, label queries with `PRIORITY = 1` and save it as name_request1 (name_request1.csv) (*label by oracle*)
- There is also the buttom: `Download predictions`. In the first optimization, we do not have predictions because we first need to introduce some labels into the model. 

Iteration Process

Once the first iteration is complete, proceed with additional iterations:

- `Upload` the .csv with the updated results `name_request1` (name_request1.csv) as a new scope `name_1`
- Repeat the optimization steps outline above
- From now on, download both `Download scope` and `Download predictions`, naming them accordingly: name_request2.csv (for the data to label in the second iteration) and name_pred1.csv (for the predictions with the first set of labels), respectively.
- Continue iterating as needed

**NB!** Be aware that the names `Temperature`, `Concentration` and `Yield` must  use capitalized first letters! \
**NB!** Ensure all .csv files (e.g., name_request1.csv, name_request2.csv, etc.) are saved in the same directory as this Jupyter notebook. \
**NB!** In Section 1.1.1 (Define and Load Chemical Space), update: the chemical space parameters (if modified), and the file names to match your iteration numbers.

# Part 2: Active Learning using modAL library

To demonstrate the power of data-driven optimization, we apply AL to a chemical reaction dataset, using `surrogate models` to iteratively select the most informative experiments. This approach mirrors the logic of platforms like EDBO+ but is implemented here with customizable Python workflows.

The dataset contains as features: `Temperature` and `Concentration` and as target property `Yield`.

The dataset can be found as a csv file with the name `chemical_process_AL_TC.csv` in the github repository: 

**Methodology**
Two models are going to be used to predict yields and guide AL: linear regression and Gaussian Process. 

**Active learning 1orkflow**

        Initialization: Start with a small labeled subset (X_initial, y_initial).

        Query Strategy:

            For Linear Regression: Select points with the highest prediction error (MSE-based).

            For GP: Prioritize points with highest uncertainty (standard deviation).

        Iteration: Repeatedly query the "oracle" (e.g., new experiments) and retrain.

**Visualization**
Track performance (R²) and training set growth across iterations to evaluate efficiency.

**Goal**: assess how the model performance increases as long as we introduced the selected data points.

### 2.1. Load dataset

In [None]:
import warnings

# If you want to suppress all warnings (not recommended unless necessary)
warnings.filterwarnings("ignore") 

In [None]:
# Load the data
data = pd.read_csv("chemical_AL_TC.csv")

# Display all columns in the dataframe
print(data.columns.tolist())

df_features = data.drop(columns=['Yield'])
X = df_features.values
y = data['Yield'].values

print('Track the size of the data set', 
      'X input:', X.shape, 
      'y target property:', y.shape)

In [None]:
# plot to visualize the evolution of the training 
def plot_active_learning_results(performance: list, set_sizes: list, model_name: str):
    """
    Plots with the performance evolution during the active learning
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
    
    # Plot 1: Performance vs. Iterations
    ax1.plot(range(1, len(performance) + 1), performance, 'bo-', label=model_name)
    ax1.set_xlabel("Iteration")
    ax1.set_ylabel("Test Score (R²)")
    ax1.set_title(f"{model_name} Performance")
    ax1.grid(True, linestyle='--', alpha=0.7)
    ax1.legend()
    
    # Plot 2: Training Set Size Growth
    ax2.plot(range(1, len(set_sizes) + 1), set_sizes, 'ro-')
    ax2.set_xlabel("Iteration")
    ax2.set_ylabel("Training Set Size")
    ax2.set_title("Active Learning: Training Set Growth")
    ax2.grid(True, linestyle='--', alpha=0.7)
    
    plt.tight_layout()
    plt.show()

### 2.2. Select the parameters to initiate and query the model

In [None]:
# Number of data points to initialize the first model
n_initial = 50

# Number of query instances to retrieve during each Active Learning workflow iteration
n_instances_per_query = 5

# Total number of Active Learning cycles/iterations to perform
n_iterations = 35 

### 2.3. Split dataset into: initial, pool, and test sets

The data is divided into three distinct sets: 1) a training set to build the initial model, 2) a pool set from which data points can be selected for labeling, and 3) a test set to evaluate model performance.

In [None]:
# Step 1: Set aside a test set
X_rest, X_test, y_rest, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: From rest, sample initial training data
initial_idx = np.random.choice(range(len(X_rest)), size=n_initial, replace=False)
X_initial = X_rest[initial_idx]
y_initial = y_rest[initial_idx]

# Step 3: Everything else becomes the pool
mask = np.ones(len(X_rest), dtype=bool)
mask[initial_idx] = False
X_pool = X_rest[mask]
y_pool = y_rest[mask]

print(len(X_pool), len(y_pool))

### 2.4. Linear Regression: building and initial set

In [None]:
# Train a linear regression model on the initial dataset
lin_reg = LinearRegression()
lin_reg.fit(X_initial, y_initial)

### 2.5. Linear Regression: query strategy and learner definition

In this cell, we define a custom query strategy for the active learning process using a linear regression model. The `mse_query_strategy` function calculates the individual squared errors for each instance in the pool and identifies the indices of the instances with the highest errors, indicating they are the most informative to query.

In [None]:
def mse_query_strategy(classifier, X, n_instances=n_instances_per_query):
    
    y_pred = classifier.predict(X)
    individual_mse = (y_pred - np.mean(y_pred))**2  # This gives squared errors per instance
    query_idx, _ = multi_argmax(individual_mse.flatten(), n_instances=n_instances)  #  indices to query in the pool

    return query_idx

# Create an active learner for Linear Regression using the modAL library
lin_learner = ActiveLearner(estimator=lin_reg, 
                            query_strategy=mse_query_strategy, 
                            X_training=X_initial, 
                            y_training=y_initial)

### 2.6. Linear Regression: active learning workflow

In [None]:
print("Current X_training shape:", lin_learner.X_training.shape)
print("Current y_training shape:", lin_learner.y_training.shape)
print("X shape:", X_pool.shape)
print("y_pool shape:", y_pool.shape)

In [None]:
# Perform active learning
lin_performance = []
training_set_sizes = []

for i in range(n_iterations):

    print(f'Iteration {i+1}')
    # Query from available points
    query_idx_lin, _ = lin_learner.query(X_pool, n_instances=n_instances_per_query)
    print('   Query length', len(query_idx_lin))
    print('   Indices to query from the pool', query_idx_lin)
    print()
     
    # Obtain the label from the oracle
    query_label_lin = y_pool[query_idx_lin].reshape(-1,)
    query_instance_lin = X_pool[query_idx_lin]
    
    # Teach augments the available training data with the new samples
    lin_learner.teach(X=query_instance_lin, y=query_label_lin)

    # Evaluate the model performance on the test set
    lin_score = lin_learner.score(X_test, y_test)

    # Print the results and append the scores to the performance lists

    print(f"   Linear Regression Test Score R²: {lin_score:.4f}")
    print(f"   Current training set size, X: {len(lin_learner.X_training)}, y:, {len(lin_learner.y_training)}")
    
    # Optionally remove these from pool
    X_pool = np.delete(X_pool, query_idx_lin, axis=0)
    y_pool = np.delete(y_pool, query_idx_lin, axis=0)

    training_set_sizes.append(len(lin_learner.X_training))
    lin_performance.append(lin_score)


### 2.7. Linear Regression: evaluating the evolution of performance in the AL workflow

In [None]:
plot_active_learning_results(lin_performance, training_set_sizes, model_name = 'Linear Regression')

### 2.8. Gaussian Processes: query strategy and learner definition

In this cell, we define a custom query strategy for the AL workflow utilizing Gaussian Process (GP) regression. The `GP_regression_std` function selects instances based on their predicted uncertainty, which is an essential aspect of active learning. 

In [None]:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, RBF

# setting the query strategy
def GP_regression_std(regressor, X_sample, n_instances=1):

    y_pred, std = regressor.predict(X_sample, return_std=True, )
    query_idx, _ = multi_argmax(std.flatten(), n_instances=n_instances)     # Select instances with highest uncertainty

    return query_idx

kernel = RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e3)) \
         + WhiteKernel(noise_level=1, noise_level_bounds=(1e-10, 1e+1))

GP_learner = ActiveLearner(estimator=GaussianProcessRegressor(kernel=kernel),
                            query_strategy=GP_regression_std,
                            X_training=X_initial,
                            y_training=y_initial)

### 2.9. GP: active learning worflow

In [None]:
# Load the data set again to avoid overwriting
mask = np.ones(len(X_rest), dtype=bool)
mask[initial_idx] = False
X_pool = X_rest[mask]
y_pool = y_rest[mask]

print(len(X_pool), len(y_pool))

In [None]:
# Perform active learning using Gaussian Processes
GP_performance = []
training_set_sizes_GP = []

for i in range(n_iterations):

    print(f'Iteration {i+1}')
    # Query the most uncertain data point for each model
    query_idx_GP, query_instance_GP = GP_learner.query(X_pool,  n_instances=n_instances_per_query)
    print('   Query length', len(query_idx_GP))
    print('   Indices to query from the pool', query_idx_GP)
    print()
    
    # Obtain the label from the oracle
    query_label_GP = y_pool[query_idx_GP].reshape(-1,)
    query_instance_GP = X_pool[query_idx_GP]
    
    # Teach augments the available training data with the new samples
    GP_learner.teach(X=query_instance_GP, y=query_label_GP) # y[query_idx_GP].reshape(1, -1))

    # Evaluate the model performance on the test set
    GP_score = GP_learner.score(X_test, y_test)

    print(f"   GP Regression Test Score R²: {GP_score:.4f}")
    print(f"   Current training set size, X: {len(GP_learner.X_training)}, y:, {len(GP_learner.y_training)}")
    
    # Optionally remove these from pool
    X_pool = np.delete(X_pool, query_idx_GP, axis=0)
    y_pool = np.delete(y_pool, query_idx_GP, axis=0)
     
    GP_performance.append(GP_score)
    training_set_sizes_GP.append(len(GP_learner.X_training))
    

### 2.10. GP: evaluating the evolution of performance in the AL workflow

In [None]:
plot_active_learning_results(GP_performance, training_set_sizes_GP, model_name='Gaussian Processes')

### 2.11. Comparison Linear Regression vs. GPs

In [None]:
# Get test set predictions
y_pred_lin = lin_learner.predict(X_test)  # Linear model
y_pred_gp, _ = GP_learner.predict(X_test, return_std=True)  # GP model (mean prediction)

In [None]:
# plot the predictions with linear and GPs
print("Comparison of the results:")
print(f"  Linear Regression Test Score (R²): {lin_score:.4f}")
print(f"  Gaussian Process Test Score (R²): {GP_score:.4f}")

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))

# Plot Linear Regression predictions
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred_lin, alpha=0.5, c='orange', label='Linear Regression')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)  # Diagonal line
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.text(0.05, 0.9, f'R² = {lin_score:.3f}', transform=plt.gca().transAxes,
         bbox=dict(facecolor='white', alpha=0.8))
plt.title('Linear Regression: True vs Predicted')
plt.legend()

# Plot Gaussian Process predictions
plt.subplot(1, 2, 2)
plt.scatter(y_test, y_pred_gp, alpha=0.5, c='violet', label='Gaussian Process')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.text(0.05, 0.9, f'R² = {GP_score:.3f}', transform=plt.gca().transAxes,
         bbox=dict(facecolor='white', alpha=0.8))
plt.title('Gaussian Process: True vs Predicted')
plt.legend()

plt.tight_layout()
plt.show()

### 2.12. Next Steps: Experimenting with Active Learning Parameters

Now that you’ve observed the predictions from both models (Linear Regression and Gaussian Process) under the current active learning strategy, you can explore how adjusting parameters affects performance. Focus on the initialization and querying logic in **Section 2.2** of your notebook, where you:

    Define the initial training set (X_initial, y_initial),

    Set the number of query points (e.g., n_instances_per_query),

    Define the pool (X_pool, y_pool).

Key Parameters to Tune
1. **n_initial**	Size of the starting labeled set	10, 20, 50
2. **n_instances_per_query** Number of points queried per iteration	1, 5, 10

How to Proceed

    Modify parameters in Section 2.2, then re-run all subsequent cells (to avoid dataset/state corruption).
    
    Compare results using your plot_active_learning_results() function.
    
    Observe trade-offs:
    
        Smaller n_initial → Faster but risk poor initial performance.
        Larger n_instances_per_query → Faster convergence but may select redundant points.

**NB!** Always re-run from Section 2.2 after changes to ensure clean experiments.