Changed package name from confidentlearning to cleanlab

cgnorthcutt · Nov 21, 2018 · 3693cc0 · 3693cc0
1 parent 0b14da7
commit 3693cc0
Show file tree

Hide file tree

Showing 22 changed files with 337 additions and 2,770 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,5 @@
 # Added by Curtis G. Northcutt
-confidentlearning/*.ipynb
+cleanlab/*.ipynb
 examples/models/*.ipynb
 
 # Byte-compiled / optimized / DLL files

diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# ```confidentlearning```
+# ```cleanlab```
 A Python package for Confident Learning with state-of-the-art algorithms for multiclass learning with noisy labels, detection of label errors in massive datasets, latent noisy channel estimation, latent prior estimation, and much more.
 
 #### Confident learning theory and algorithms are:
@@ -13,18 +13,18 @@ Check out these [examples](examples).
 
 Python 2.7 and Python 3.5 are supported.
 
-To install the `confidentlearning` package with pip, just run:
+To install the `cleanlab` package with pip, just run:
 
 ```
-$ pip install git+https://github.com/cgnorthcutt/confidentlearning.git
+$ pip install git+https://github.com/cgnorthcutt/cleanlab.git
 ```
 
 If you have issues, you can also clone the repo and install:
 
 ```
 $ conda update pip # if you use conda
-$ git clone https://github.com/cgnorthcutt/confidentlearning.git
-$ cd confidentlearning
+$ git clone https://github.com/cgnorthcutt/cleanlab.git
+$ cd cleanlab
 $ pip install -e .
 ```
 ## Citations and Related Publications
@@ -57,20 +57,20 @@ These examples show how easy it is to characterize label noise in datasets, lear
 <!---
 
 ## Automatically identify ~50 label errors in MNIST with confident learning. [[link]](examples/finding_MNIST_label_errors).
-![Image depicting label errors in MNIST train set.](https://raw.githubusercontent.com/cgnorthcutt/confidentlearning/master/img/mnist_training_label_errors24_prune_by_noise_rate.png)
+![Image depicting label errors in MNIST train set.](https://raw.githubusercontent.com/cgnorthcutt/cleanlab/master/img/mnist_training_label_errors24_prune_by_noise_rate.png)
 Label errors of the original MNIST **train** dataset identified algorithmically using the rankpruning algorithm. Depicts the 24 least confident labels, ordered left-right, top-down by increasing self-confidence (probability of belonging to the given label), denoted conf in teal. The label with the largest predicted probability is in green. Overt errors are in red.
 
-![Image depicting label errors in MNIST test set.](https://raw.githubusercontent.com/cgnorthcutt/confidentlearning/master/img/mnist_test_label_errors8.png)
+![Image depicting label errors in MNIST test set.](https://raw.githubusercontent.com/cgnorthcutt/cleanlab/master/img/mnist_test_label_errors8.png)
  Selected label errors in the MNIST **test** dataset ordered by increasing self-confidence (in teal).
 
 ## Automatically identify ~5k (of 50k) validation set label errors in ImageNet. [[link]](examples/finding_ImageNet_label_errors).
-![Image depicting label errors in ImageNet validation set.](https://raw.githubusercontent.com/cgnorthcutt/confidentlearning/master/img/imagenet_validation_label_errors_96_prune_by_noise_rate.jpg)
+![Image depicting label errors in ImageNet validation set.](https://raw.githubusercontent.com/cgnorthcutt/cleanlab/master/img/imagenet_validation_label_errors_96_prune_by_noise_rate.jpg)
 Label errors in the 2012 ImageNet validation dataset identified automatically with confident learning using a pre-trained resnet18. Displayed are the 96 least confident labels. We see that ImageNet contains numerous multi-label images, although it is used widely by the machine learning and vision communities as a single-label benchmark dataset.
 
 --->
 
-## Use `confidentlearning` with any model (Tensorflow, caffe2, PyTorch, etc.)
-All of the features of the `confidentlearning` package work with **any model**. Yes, any model. Feel free to use PyTorch, Tensorflow, caffe2, scikit-learn, mxnet, etc. If you use a scikit-learn classifier, all `confidentlearning` methods will work out-of-the-box. It's also easy to use your favorite model from a non-scikit-learn package, just wrap your model into a Python class that inherets the `sklearn.base.BaseEstimator`:
+## Use `cleanlab` with any model (Tensorflow, caffe2, PyTorch, etc.)
+All of the features of the `cleanlab` package work with **any model**. Yes, any model. Feel free to use PyTorch, Tensorflow, caffe2, scikit-learn, mxnet, etc. If you use a scikit-learn classifier, all `cleanlab` methods will work out-of-the-box. It's also easy to use your favorite model from a non-scikit-learn package, just wrap your model into a Python class that inherets the `sklearn.base.BaseEstimator`:
 ```python
 from sklearn.base import BaseEstimator
 class YourFavoriteModel(BaseEstimator): # Inherits sklearn base classifier
@@ -85,14 +85,14 @@ class YourFavoriteModel(BaseEstimator): # Inherits sklearn base classifier
     def score(self, X, y, sample_weight = None):
         pass
 
-# Now you can use your model with `confidentlearning`. Here's one example:
-from confidentlearning.classification import RankPruning
+# Now you can use your model with `cleanlab`. Here's one example:
+from cleanlab.classification import RankPruning
 rp = RankPruning(clf=YourFavoriteModel())
 rp.fit(train_data, train_labels_with_errors)
 ```
-### Want to see a working example? [Here's a compliant PyTorch MNIST CNN class](https://github.com/cgnorthcutt/confidentlearning/blob/master/examples/models/mnist_pytorch.py#L28)
+### Want to see a working example? [Here's a compliant PyTorch MNIST CNN class](https://github.com/cgnorthcutt/cleanlab/blob/master/examples/models/mnist_pytorch.py#L28)
 
-As you can see [here](https://github.com/cgnorthcutt/confidentlearning/blob/master/examples/models/mnist_pytorch.py#L28), technically you don't actually need to inherit from `sklearn.base.BaseEstimator`, as you can just create a class that defines .fit(), .predict(), and .predict_proba(), but inheriting makes downstream scikit-learn applications like hyper-parameter optimization work seamlessly. For example, the [RankPruning() model](https://github.com/cgnorthcutt/confidentlearning/blob/master/confidentlearning/classification.py#L48) is fully compliant.
+As you can see [here](https://github.com/cgnorthcutt/cleanlab/blob/master/examples/models/mnist_pytorch.py#L28), technically you don't actually need to inherit from `sklearn.base.BaseEstimator`, as you can just create a class that defines .fit(), .predict(), and .predict_proba(), but inheriting makes downstream scikit-learn applications like hyper-parameter optimization work seamlessly. For example, the [RankPruning() model](https://github.com/cgnorthcutt/cleanlab/blob/master/cleanlab/classification.py#L48) is fully compliant.
 
 
 ## Documentation by Example - Quick Tutorials
@@ -102,7 +102,7 @@ Many of these methods have default parameters that won't be covered here. Check
 ### Multiclass learning with noisy labels (in **3** lines of code):
 **rankpruning** is a fast, general, robust algorithm for multiclass learning with noisy labels. It adds minimal overhead, needing only *O(nm<sup>2</sup>)* time for n training examples and m classes, works with any classifier, and is easy to use.
 ```python
-from confidentlearning.classification import RankPruning
+from cleanlab.classification import RankPruning
 # RankPruning uses logreg by default, so this is unnecessary. 
 # We include it here for clarity, but this step is omitted below.
 from sklearn.linear_model import LogisticRegression as logreg
@@ -122,14 +122,14 @@ predicted_test_labels = rp.predict(X_test)
 ``` 
 
 ### Estimate the confident joint, the latent noisy channel matrix, *P<sub>s | y</sub>* and inverse, *P<sub>y | s</sub>*, the latent prior of the unobserved, actual true labels, *p(y)*, and the predicted probabilities.:
-where *s* denotes a random variable that represents the observed, noisy label and *y* denotes a random variable representing the hidden, actual labels. Both *s* and *y* take any of the m classes as values. The `confidentlearning` package supports different levels of granularity for computation depending on the needs of the user. Because of this, we support multiple alternatives, all no more than a few lines, to estimate these latent distribution arrays, enabling the user to reduce computation time by only computing what they need to compute, as seen in the examples below.
+where *s* denotes a random variable that represents the observed, noisy label and *y* denotes a random variable representing the hidden, actual labels. Both *s* and *y* take any of the m classes as values. The `cleanlab` package supports different levels of granularity for computation depending on the needs of the user. Because of this, we support multiple alternatives, all no more than a few lines, to estimate these latent distribution arrays, enabling the user to reduce computation time by only computing what they need to compute, as seen in the examples below.
 
 Throughout these examples, you'll see a variable called *confident_joint*. The confident joint is an m x m matrix (m is the number of classes) that counts, for every observed, noisy class, the number of examples that confidently belong to every latent, hidden class. It counts the number of examples that we are confident are labeled correctly or incorrectly for every pair of obseved and unobserved classes. The confident joint is an unnormalized estimate of the complete-information latent joint distribution, *P<sub>s,y</sub>*. Most of the methods in the **confidentlearing** package start by first estimating the *confident_joint*.
 
 #### Option 1: Compute the confident joint and predicted probs first. Stop if that's all you need.
 ```python
-from confidentlearning.latent_estimation import estimate_latent
-from confidentlearning.latent_estimation import estimate_confident_joint_and_cv_pred_proba
+from cleanlab.latent_estimation import estimate_latent
+from cleanlab.latent_estimation import estimate_confident_joint_and_cv_pred_proba
 
 # Compute the confident joint and the n x m predicted probabilities matrix (psx),
 # for n examples, m classes. Stop here if all you need is the confident joint.
@@ -145,7 +145,7 @@ est_py, est_nm, est_inv = estimate_latent(confident_joint, s=train_labels_with_e
 
 #### Option 2: Estimate the latent distribution matrices in a single line of code.
 ```python
-from confidentlearning.latent_estimation import estimate_py_noise_matrices_and_cv_pred_proba
+from cleanlab.latent_estimation import estimate_py_noise_matrices_and_cv_pred_proba
 est_py, est_nm, est_inv, confident_joint, psx = estimate_py_noise_matrices_and_cv_pred_proba(
     X=X_train,
     s=train_labels_with_errors,
@@ -156,8 +156,8 @@ est_py, est_nm, est_inv, confident_joint, psx = estimate_py_noise_matrices_and_c
 ```python
 # Already have psx? (n x m matrix of predicted probabilities)
 # For example, you might get them from a pre-trained model (like resnet on ImageNet)
-# With the confidentlearning package, you estimate directly with psx.
-from confidentlearning.latent_estimation import estimate_py_and_noise_matrices_from_probabilities
+# With the cleanlab package, you estimate directly with psx.
+from cleanlab.latent_estimation import estimate_py_and_noise_matrices_from_probabilities
 est_py, est_nm, est_inv, confident_joint = estimate_py_and_noise_matrices_from_probabilities(
     s=train_labels_with_errors, 
     psx=psx,
@@ -166,10 +166,10 @@ est_py, est_nm, est_inv, confident_joint = estimate_py_and_noise_matrices_from_p
 ``` 
 
 ### Estimate label errors in a dataset:
-With the `confidentlearning` package, we can instantly fetch the indices of all estimated label errors, with nothing provided by the user except a classifier, examples, and their noisy labels. Like the previous example, there are various levels of granularity.
+With the `cleanlab` package, we can instantly fetch the indices of all estimated label errors, with nothing provided by the user except a classifier, examples, and their noisy labels. Like the previous example, there are various levels of granularity.
 
 ```python
-from confidentlearning.pruning import get_noise_indices
+from cleanlab.pruning import get_noise_indices
 # We computed psx, est_inv, confident_joint in the previous example.
 label_errors = get_noise_indices(
     s=train_labels_with_errors, # required
@@ -187,7 +187,7 @@ There are two methods to compute *P<sub>s,y</sub>*, the complete-information dis
 This method occurs when hyperparameter prune_count_method = 'inverse_nm_dot_s' in RankPruning.fit() and get_noise_indices(). 
 
 ```python
-from confidentlearning.util import value_counts
+from cleanlab.util import value_counts
 # *p(s)* is the prior of the observed, noisy labels and an array of length m (# of classes)
 ps = value_counts(s) / float(len(s))
 # We computed est_inv (estimated inverse noise matrix) in the previous example (two above).
@@ -198,7 +198,7 @@ psy = np.transpose(est_inv * ps) # Matrix of prob(s=l and y=k)
 #### Method 2: Simplest. Compute by re-normalizing the confident joint. Rows won't sum to *p(s)*
 This method occurs when hyperparameter prune_count_method = 'calibrate_confident_joint' in RankPruning.fit() and get_noise_indices().
 ```python
-from confidentlearning.util import value_counts
+from cleanlab.util import value_counts
 # *p(s)* is the prior of the observed, noisy labels and an array of length m (# of classes)
 ps = value_counts(s) / float(len(s))
 # We computed confident_joint in the previous example (two above).
@@ -209,7 +209,7 @@ psy = confident_joint / float(confident_joint.sum()) # calibration, i.e. re-norm
 
 ```python
 # Generate a valid (necessary conditions for learnability are met) noise matrix for any trace > 1
-from confidentlearning.noise_generation import generate_noise_matrix_from_trace
+from cleanlab.noise_generation import generate_noise_matrix_from_trace
 noise_matrix = generate_noise_matrix_from_trace(
     K = number_of_classes, 
     trace = float_value_greater_than_1_and_leq_K,
@@ -218,7 +218,7 @@ noise_matrix = generate_noise_matrix_from_trace(
 )
 
 # Check if a noise matrix is valid (necessary conditions for learnability are met)
-from confidentlearning.noise_generation import noise_matrix_is_valid
+from cleanlab.noise_generation import noise_matrix_is_valid
 is_valid = noise_matrix_is_valid(noise_matrix, prior_of_y_which_is_just_an_array_of_length_K)
 
 ```
@@ -227,7 +227,7 @@ is_valid = noise_matrix_is_valid(noise_matrix, prior_of_y_which_is_just_an_array
 
 ```python
 # Generate noisy labels using the noise_marix. Guarantees exact amount of noise in labels.
-from confidentlearning.noise_generation import generate_noisy_labels
+from cleanlab.noise_generation import generate_noisy_labels
 s_noisy_labels = generate_noisy_labels(y_hidden_actual_labels, noise_matrix)
 
 # This package is a full of other useful methods for learning with noisy labels.
@@ -240,4 +240,4 @@ s_noisy_labels = generate_noisy_labels(y_hidden_actual_labels, noise_matrix)
 ### The key to learning in the presence of label errors is estimating the joint distribution between the actual, hidden labels '*y*' and the observed, noisy labels '*s*'. Using confident learning, we can completely characterize the trace of the latent joint distribution, *trace(P<sub>s,y</sub>)*, given *p(y)*, for any fraction of label errors, i.e. for any trace of the noisy channel, *trace(P<sub>s|y</sub>)*.
 You can check out how to do this yourself here:
 1. [Drawing Polyplices](examples/drawing_polyplices.ipynb)
-2. [Computing Polyplices](confidentlearning/polyplex.ipynb)
+2. [Computing Polyplices](cleanlab/polyplex.ipynb)
diff --git a/__init__.py b/__init__.py
diff --git a/confidentlearning/__init__.py b/confidentlearning/__init__.py