# Hyperparameter Optimization Example

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cleanlab/examples/blob/master/hyperparameter_optimization/hyperparameter_optimization.ipynb) 

This example will show you the main hyper-parameters for CleanLearning. There are only two!

1. `filter_by` : str (default: `'prune_by_noise_rate'`), Method used for pruning.
    * Values: [`'prune_by_class'`, `'prune_by_noise_rate'`, or `'both'`]. 
    * `'prune_by_noise_rate'`: works by removing examples with *high probability* of being mislabeled for every non-diagonal in the prune_counts_matrix (see filter.py).
    * `'prune_by_class'`: works by removing the examples with *smallest probability* of belonging to their given class label for every class.
    * `'both'`: Finds the examples satisfying (1) AND (2) and removes their set conjunction. 


2. converge_latent_estimates : bool (Default: False)
    * If true, forces numerical consistency of latent estimates. Each is estimated independently, but they are related mathematically with closed form  equivalences. This will iteratively enforce mathematically consistency.

Please install the dependencies specified in this [requirements.txt](https://github.com/cleanlab/examples/blob/master/hyperparameter_optimization/requirements.txt) file before running the notebook.

In [1]:
from cleanlab.classification import CleanLearning
from cleanlab.benchmarking.noise_generation import generate_noise_matrix_from_trace
from cleanlab.benchmarking.noise_generation import generate_noisy_labels
from cleanlab.internal.util import print_noise_matrix
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
import numpy as np
import copy


In [2]:
def make_linear_dataset(n_classes=3, n_samples=300):
    X, y = make_classification(
        n_samples=n_samples,
        n_features=2,
        n_redundant=0,
        n_informative=2,
        random_state=1,
        n_clusters_per_class=1,
        n_classes=n_classes,
    )
    rng = np.random.RandomState(2)
    X += 2 * rng.uniform(size=X.shape)
    return (X, y)


In [3]:
# hyper-parameters
param_grid = {
    "find_label_issues_kwargs": [
        {"filter_by": "prune_by_noise_rate"},
        {"filter_by": "prune_by_class"},
        {"filter_by": "both"},
    ],
    "converge_latent_estimates": [True, False],
}


In [4]:
# Set the sparsity of the noise matrix.
frac_zero_noise_rates = 0.0  # Consider increasing to 0.5
# A proxy for the fraction of labels that are correct.
avg_trace = 0.65  # ~35% wrong labels. Increasing makes the problem easier.
# Amount of data for each dataset.
dataset_size = 250  # Try 250 or 400 to use less or more data.
num_classes = 3

ds = make_linear_dataset(n_classes=num_classes, n_samples=num_classes * dataset_size)
X, y = ds
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)


## Run hyper-parameter search with sklearn GridSearchCV

In [5]:
for name, clf in [
    (
        "Naive Bayes",
        GaussianNB(),
    ),
    (
        "Logistic Regression",
        LogisticRegression(random_state=0, solver="lbfgs", multi_class="auto"),
    ),
]:
    print("\n", "=" * len(name), "\n", name, "\n", "=" * len(name))
    np.random.seed(seed=0)
    clf_copy = copy.deepcopy(clf)
    # Compute p(y=k), the ground truth class prior on the labels.
    py = np.bincount(y_train) / float(len(y_train))
    # Generate the noisy channel to characterize the label errors.
    noise_matrix = generate_noise_matrix_from_trace(
        K=num_classes,
        trace=num_classes * avg_trace,
        py=py,
        frac_zero_noise_rates=frac_zero_noise_rates,
    )
    print_noise_matrix(noise_matrix)

    # Create the noisy labels. This method is exact w.r.t. the noise_matrix.
    y_train_with_errors = generate_noisy_labels(y_train, noise_matrix)

    # Run GridSearch with Cross-Validation
    lnl_cv = GridSearchCV(
        estimator=CleanLearning(clf, verbose=False),
        param_grid=param_grid,
    )
    lnl_cv.fit(X=X_train, y=y_train_with_errors)

    # Also compute the test score with default parameters
    clf_copy.fit(X_train, y_train_with_errors)
    score_opt = lnl_cv.score(X_test, y_test)
    score_default = clf_copy.score(X_test, y_test)
    print("Accuracy with default parameters:", np.round(score_default, 2))
    print("Accuracy with optimized parameters:", np.round(score_opt, 2))
    print()
    s = "Optimal parameter settings using {}".format(name)
    print(s)
    print("-" * len(s))
    for key in lnl_cv.get_params().keys():
        print(key, ":", lnl_cv.get_params()[key])



 Naive Bayes 

 Noise Matrix (aka Noisy Channel) P(given_label|true_label) of shape (3, 3)
 p(s|y)	y=0	y=1	y=2
	---	---	---
s=0 |	0.52	0.1	0.34
s=1 |	0.2	0.82	0.05
s=2 |	0.28	0.07	0.61
	Trace(matrix) = 1.95

Accuracy with default parameters: 0.65
Accuracy with optimized parameters: 0.7

Optimal parameter settings using Naive Bayes
--------------------------------------------
cv : None
error_score : nan
estimator__clf__priors : None
estimator__clf__var_smoothing : 1e-09
estimator__clf : GaussianNB()
estimator__converge_latent_estimates : False
estimator__cv_n_folds : 5
estimator__find_label_issues_kwargs : {}
estimator__label_quality_scores_kwargs : {}
estimator__pulearning : None
estimator__seed : None
estimator__verbose : False
estimator : CleanLearning(clf=GaussianNB())
n_jobs : None
param_grid : {'find_label_issues_kwargs': [{'filter_by': 'prune_by_noise_rate'}, {'filter_by': 'prune_by_class'}, {'filter_by': 'both'}], 'converge_latent_estimates': [True, False]}
pre_dispatch : 2*n_j