#### Active Learning Algorithm
Active learning is a machine learning paradigm where the algorithm can query a user or an oracle to obtain labels for new data points. The goal is to achieve high performance with fewer labeled instances by strategically selecting which data points to label.

#### Use Cases for Active Learning
- Image Classification: In scenarios where labeling images is expensive or time-consuming, active learning can help focus on uncertain samples to improve the model's accuracy efficiently.

- Natural Language Processing: Active learning can be applied in tasks such as sentiment analysis or named entity recognition, where labeling text can be subjective and labor-intensive.

- Medical Diagnosis: In medical imaging, where experts are needed to label images, active learning can help prioritize which images to review.

- Anomaly Detection: Active learning can be used to identify and label anomalous data points that may be rare and critical for training robust models.

- Recommender Systems: In scenarios where user feedback is limited, active learning can be employed to gather relevant data points for better recommendations.

#### Generating Logical Data for Active Learning
We can generate synthetic logical data, such as a set of binary features with a binary target variable, to simulate a classification problem. Here’s how to create such data:

In [1]:
import numpy as np

# Generate random logical data
np.random.seed(42)
num_samples = 1000
num_features = 5

# Random binary features
X = np.random.randint(0, 2, size=(num_samples, num_features))

# Target variable (logical AND operation)
# For instance, the target could be 1 if all features are 1, otherwise 0
y = np.all(X, axis=1).astype(int)


#### Implementing Active Learning from Scratch Using NumPy
Below is a simple implementation of an active learning algorithm using NumPy. We'll use a basic model (e.g., logistic regression) and implement a pool-based active learning strategy:

In [8]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

class ActiveLearner:
    def __init__(self, X_pool, y_pool, initial_size=10, query_size=5):
        # Initialize the Active Learner with given parameters
        self.X_pool = X_pool  # Pool of unlabeled samples
        self.y_pool = y_pool  # Corresponding labels for the pool
        self.initial_size = initial_size  # Number of samples to initially label
        self.query_size = query_size  # Number of samples to query in each iteration
        self.model = LogisticRegression()  # Logistic Regression model for classification
        
        # Select the initial labeled dataset
        self.X_labeled = self.X_pool[:self.initial_size]
        self.y_labeled = self.y_pool[:self.initial_size]
        
        # Train the model on the initial labeled data
        self.model.fit(self.X_labeled, self.y_labeled)
        
        # Remove the initial labeled data from the pool
        self.X_pool = self.X_pool[self.initial_size:]
        self.y_pool = self.y_pool[self.initial_size:]

    def query(self):
        # Check if there are enough samples in the pool to query
        if len(self.X_pool) < self.query_size:
            print("Not enough samples in the pool to query.")
            return self.X_pool, self.y_pool, np.arange(len(self.X_pool))  # Return whatever is available

        # Get predicted probabilities for the current model on the unlabeled pool
        probs = self.model.predict_proba(self.X_pool)
        
        # Calculate uncertainty as 1 - max probability (higher values indicate more uncertainty)
        uncertainty = 1 - np.max(probs, axis=1)
        
        # Select the indices of the most uncertain samples
        query_indices = np.argsort(uncertainty)[-self.query_size:]  # Get indices of the top uncertain samples
        print(f"Query indices: {query_indices}, Uncertainty: {uncertainty[query_indices]}")  # Debugging info
        
        # Return the samples and their true labels
        return self.X_pool[query_indices], self.y_pool[query_indices], query_indices

    def update(self, X_new, y_new):
        # Ensure there are new samples to update
        if len(X_new) == 0 or len(y_new) == 0:
            print("No new samples to update.")
            return  # Early return if no new samples are available

        # Check the labels in new samples to ensure we are not just getting one class
        unique_labels = np.unique(y_new)
        if len(unique_labels) < 2:
            print("New samples must contain both classes for the model to learn.")
            return  # Exit if new samples do not provide both classes
        
        # Append the newly labeled samples to the existing labeled dataset
        self.X_labeled = np.vstack((self.X_labeled, X_new))
        self.y_labeled = np.concatenate((self.y_labeled, y_new))
        
        # Create a mask to identify new samples in the pool
        mask = np.isin(self.X_pool, X_new).any(axis=1)  # This checks if each row in X_pool matches any row in X_new
        self.X_pool = self.X_pool[~mask]  # Keep only samples not in X_new
        self.y_pool = self.y_pool[~mask]  # Keep only labels not in y_new
        
        # Print the updated sizes of the labeled dataset and the remaining pool
        print(f"Updated labeled set size: {len(self.X_labeled)}, Pool size: {len(self.X_pool)}")
        
        # Retrain the model with the updated labeled dataset
        self.model.fit(self.X_labeled, self.y_labeled)

    def get_performance(self, X_test, y_test):
        # Make predictions on the test set
        y_pred = self.model.predict(X_test)
        # Calculate and return the accuracy score
        return accuracy_score(y_test, y_pred)

# Simulate some random logical data for testing
num_features = 5  # Number of features for each sample
num_samples = 100  # Total number of samples
X = np.random.randint(0, 2, size=(num_samples, num_features))  # Random binary features
y = np.all(X, axis=1).astype(int)  # Labels: 1 if all features are 1, else 0

# Ensure we have at least one of each class
if np.unique(y).size < 2:
    raise ValueError("Generated labels do not contain both classes.")

# Initialize active learner
active_learner = ActiveLearner(X, y)

# Sample test set for evaluation (ensure it contains both classes)
X_test = np.random.randint(0, 2, size=(200, num_features))
y_test = np.all(X_test, axis=1).astype(int)  # Same labeling logic as above

# Ensure test set has both classes
if np.unique(y_test).size < 2:
    raise ValueError("Generated test labels do not contain both classes.")

# Simulate active learning process
for iteration in range(5):  # Run for 5 iterations
    X_query, y_query, query_indices = active_learner.query()  # Query the most uncertain samples
    active_learner.update(X_query, y_query)  # Update the model with new labels
    accuracy = active_learner.get_performance(X_test, y_test)  # Evaluate model performance
    print(f"Iteration {iteration + 1}, Accuracy: {accuracy:.4f}")  # Print the accuracy for this iteration


Query indices: [51 63 80 58 44], Uncertainty: [0.17057349 0.17198388 0.17198388 0.20477341 0.20477341]
Updated labeled set size: 15, Pool size: 0
Iteration 1, Accuracy: 0.9800
Not enough samples in the pool to query.
No new samples to update.
Iteration 2, Accuracy: 0.9800
Not enough samples in the pool to query.
No new samples to update.
Iteration 3, Accuracy: 0.9800
Not enough samples in the pool to query.
No new samples to update.
Iteration 4, Accuracy: 0.9800
Not enough samples in the pool to query.
No new samples to update.
Iteration 5, Accuracy: 0.9800


#### Explanation of Key Parts
- Imports:

    - The necessary libraries (sklearn for the logistic regression model and accuracy score, and numpy for numerical operations) are imported at the start.
- Initialization (__init__ method):

    - X_pool and y_pool: The pools of unlabeled data and their true labels are stored for querying.
    - initial_size and query_size: Define how many samples to use for initial training and how many to query in each iteration, respectively.
    - A logistic regression model is instantiated, and the initial labeled data is selected and fitted to the model. This is essential to get the first iteration started.
- Query Method:

    - The query method calculates the probabilities of the unlabeled data. The uncertainty is assessed by determining which predictions are the least confident (i.e., probabilities close to 0.5). This is crucial for active learning because the goal is to label the most informative data points.
- Update Method:

    - After querying, the update method adds the newly labeled data to the existing labeled dataset and retrains the model. This is important for improving the model's performance iteratively with new data.
- Performance Evaluation:

    - The get_performance method measures the accuracy of the model on a separate test set. Evaluating the model’s performance after each iteration helps track its improvement.
- Simulation Loop:

    - The loop simulates the active learning process for a specified number of iterations, printing the accuracy after each iteration to monitor progress. This provides a practical demonstration of how active learning refines the model over time.

#### When to Use Active Learning and When Not to Use It
- When to Use Active Learning:

    - When labeling data is expensive or time-consuming.
    - When you have a large pool of unlabeled data but only a limited budget for labeling.
    - When the model can benefit from focusing on uncertain samples.
    - When working with complex tasks where expert labeling is required.

- When Not to Use Active Learning:

    - When you have ample labeled data already available.
    - When the cost of labeling is low, and traditional supervised learning suffices.
    - When your task does not benefit significantly from uncertainty sampling.

##### What is the Loss Function
In active learning, the loss function typically depends on the model being used. For logistic regression, the loss function is usually the binary cross-entropy loss, defined as:

$Loss = -\frac{1}{n} \sum_{i=1}^n \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$

Where:

* $y_i$: True label
* $\hat{y}_i$: Predicted probability of the positive class
* $n$: Number of samples

#### How to Optimize the Algorithm
- Model Selection: Use more advanced models that can better capture complex relationships in the data.

- Smart Query Strategies: Experiment with different query strategies (e.g., uncertainty sampling, query by committee) to find the most effective one for your specific use case.

- Dynamic Query Size: Adjust the size of the query based on the confidence of the model or the remaining pool of unlabeled data.

- Budget Management: Monitor the labeling budget closely and prioritize data points that will give the most significant performance gains.

- Ensemble Methods: Use ensemble methods to improve the robustness of predictions and uncertainty estimates.

- Performance Monitoring: Continuously evaluate the model's performance on a validation set to ensure that active learning is improving the model effectively.

