As this is a regressor problem, any regression model will suffice. RandomForest seems to provide the best results.

In [None]:
!pip install git+https://github.com/modAL-python/modAL.git


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/modAL-python/modAL.git
  Cloning https://github.com/modAL-python/modAL.git to /tmp/pip-req-build-yjv0w7v7
  Running command git clone --filter=blob:none --quiet https://github.com/modAL-python/modAL.git /tmp/pip-req-build-yjv0w7v7
  Resolved https://github.com/modAL-python/modAL.git to commit bba6f6fd00dbb862b1e09259b78caf6cffa2e755
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting skorch==0.9.0 (from modAL-python==0.4.2)
  Downloading skorch-0.9.0-py3-none-any.whl (125 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.8/125.8 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: modAL-python
  Building wheel for modAL-python (setup.py) ... [?25l[?25hdone
  Created wheel for modAL-python: filename=modAL_python-0.4.2-py3-none-any.whl size=32654 sha256=ab8253fd9ef4dff94ee6f0f58bf92a009a86a

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from modAL.models import ActiveLearner
from modAL.uncertainty import uncertainty_sampling
import numpy as np


diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

#split the data into labeled and unlabeled sets (50/50 split as requested)
X_labeled, X_unlabeled, y_labeled, y_unlabeled = train_test_split(X, y, test_size=0.5, random_state=42)

#active learning parameters
N_QUERIES = 20

#base estimator & query strategy
model = RandomForestRegressor(random_state=42)
query_strategy = uncertainty_sampling

#active learner
learner = ActiveLearner(
    estimator=model,
    X_training=X_labeled, y_training=y_labeled,
    query_strategy=query_strategy
)


for i in range(N_QUERIES):

    learner.fit(X_labeled, y_labeled)
    y_pred_unlabeled = learner.predict(X_unlabeled)

    #calculate the uncertainty scores using standard deviation
    uncertainty_scores = np.std(y_pred_unlabeled)

    #query the most uncertain instances
    query_idx = np.argmax(uncertainty_scores)

    #add the selected data point to the labeled set and remove it from the unlabeled set
    X_labeled = np.concatenate([X_labeled, X_unlabeled[query_idx].reshape(1, -1)])
    y_labeled = np.concatenate([y_labeled, y_unlabeled[query_idx].reshape(1, )])
    X_unlabeled = np.delete(X_unlabeled, query_idx, axis=0)
    y_unlabeled = np.delete(y_unlabeled, query_idx)

    #evaluation
    y_pred_test = learner.predict(X_unlabeled)
    mse = mean_squared_error(y_unlabeled, y_pred_test)
    r2 = learner.score(X_unlabeled, y_unlabeled)
    print("Query", i + 1, "- R2:", r2)
    print("Query", i + 1, "- Mean Squared Error on the test set:", mse)
    print('\n')


print('Without Active Learning:')
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error


diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

#split the data into training and test sets (50/50 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

y_pred_test = model.predict(X_test)

#evaluation
mse = mean_squared_error(y_test, y_pred_test)
r2 = learner.score(X_test, y_test)

print("Mean Squared Error on the test set:", mse)
print("R2:", r2)

Query 1 - R2: 0.40951033496550693
Query 1 - Mean Squared Error on the test set: 3364.265577727273


Query 2 - R2: 0.41818404952514165
Query 2 - Mean Squared Error on the test set: 3312.1580968036524


Query 3 - R2: 0.4255138586445022
Query 3 - Mean Squared Error on the test set: 3278.841186697248


Query 4 - R2: 0.43061498176778756
Query 4 - Mean Squared Error on the test set: 3248.612370506912


Query 5 - R2: 0.42519752709728764
Query 5 - Mean Squared Error on the test set: 3290.316887037037


Query 6 - R2: 0.4220638083893722
Query 6 - Mean Squared Error on the test set: 3311.265136744186


Query 7 - R2: 0.4234099682493019
Query 7 - Mean Squared Error on the test set: 3297.070135514018


Query 8 - R2: 0.4200786795219793
Query 8 - Mean Squared Error on the test set: 3292.0304596244127


Query 9 - R2: 0.4156665193488416
Query 9 - Mean Squared Error on the test set: 3323.7171679245284


Query 10 - R2: 0.42579692344880593
Query 10 - Mean Squared Error on the test set: 3273.2133682464455



With the application of active learning, it is observed that the R2 scores display gradual improvements with each additional query introduced. Simultaneously, we see a noticeable decline in mean square error as well, suggesting better predictive performance whilst decreasing prediction inaccuracies thanks to this particular approach employed. Further increasing the query steps does not considerably improve either of the result metrics.
Experiments with other regressor models display a similar such improvement in both metrics.