# Identify Label Errors with Label Inspector from AWS Marketplace 


Cleanlab's [Label Inspector](https://aws.amazon.com/marketplace/pp/prodview-rlbhc2lxttdio) automatically detects label errors in your classification dataset. You just need a (tabular or text) dataset containing class labels and feature values for each datapoint, and this solution will flag examples that are likely mislabeled.

This sample notebook demonstrates how to use the Label Inspector via Amazon SageMaker.

## Pre-requisites
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. Some hands-on experience using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
1. To use this algorithm successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to [Label Inspector](https://aws.amazon.com/marketplace/pp/prodview-rlbhc2lxttdio). 


## Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

## 1. Subscribe to Label Inspector

To subscribe to the Label Inspector offering:
1. Open the AWS Marketplace listing page for [Label Inspector](https://aws.amazon.com/marketplace/pp/prodview-rlbhc2lxttdio).
1. On the listing,  click on **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you agree with the EULA and pricing terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn**. This is the algorithm ARN that you need to specify to use this algorithm. Copy the ARN corresponding to your region and specify it in the following cell.

> **Note**: This is a reference notebook and it cannot run unless you enter the algorithm ARN in the call below.

In [1]:
algo_arn = "<Specify the ARN for Label Inspector obtained from AWS Marketplace>"

## 2. Set up environment

In [2]:
import sagemaker as sage
from sagemaker import get_execution_role
import boto3
import os
import pandas as pd
import tarfile

In [3]:
sagemaker_session = sage.Session()
role = get_execution_role()

In [4]:
# define S3 locations for saving data, replace if you would like to store your data in alternative locations
bucket = sagemaker_session.default_bucket() 
base_folder_name = "label-inspector"  
input_folder = "{}/{}".format(base_folder_name, "input")
output_location = "s3://{}/{}/{}".format(bucket, base_folder_name, "output")

instance_type = "ml.m5.xlarge"  # replace with "ml.m5.24xlarge" if more memory is required (big dataset) and to get results much faster

## 3. Prepare dataset and Upload to Amazon S3 (skip if data is already in S3)

### View Sample Dataset 

Here is an example dataset that you can run Label Inspector on. Label inspector will take approximately 5 minutes to train a ML model and identify label errors on this example dataset.

If using your own data, please ensure that the first line of your CSV file is a header with the column names for your data. The first column of your data must be the containing the class labels (remaining columns will be treated as predictive features). Also make sure that the labels are categorical strings (e.g. not continuous numbers, but discrete integers are okay), as only multi-class (and binary) classification datasets are supported. 

In [5]:
sample_input = pd.read_csv("data/input/dataset.csv")
sample_input.head(5)

Unnamed: 0,letter_grade,exam_1,exam_2,exam_3,notes
0,C,53,77,93,
1,B,81,64,80,great participation +10
2,B,74,88,97,
3,C,61,94,78,
4,C,48,90,91,


### Upload datasets to Amazon S3

In [6]:
training_dataset = "data/input/dataset.csv"  # replace filepath here if using your own dataset

In [7]:
# upload data to S3
training_data_location = sagemaker_session.upload_data(training_dataset, bucket=bucket, key_prefix=input_folder)

In [None]:
# you can find your data here after uploading
training_data_location

## 4. Train a ML model to analyze the labels in our dataset

After ensuring that our data is an accessible Amazon S3 bucket (only read permissions are required for the algorithm to access the input data bucket), we are ready to train a machine learning model. This model can be automatically trained on diverse types of tabular/text data via state-of-the-art AutoML, and is used to identify which labels are likely incorrect. 

In the code cell below we specify the S3 location of our data. If you have followed the [Upload datasets to Amazon S3](#Upload-datasets-to-Amazon-S3) section of this tutorial, the S3 location should already be specified. If you are using your own dataset, make sure to specify the location of your data below.

In [9]:
training_data = training_data_location  # replace with the S3 URI of your data if using your own dataset

Next, we specify the hyperparameters for our algorithm. Currently, our algorithm only supports one hyperparameter `runtime` with two options:

- `fast` will have a shorter execution time, but may not produce the best quality results (maximum execution time: 3 hours, will take less time for most datasets)
- `high_accuracy` will be slower, but produces high quality results (maximum execution time: 10 hours, will take less time for most datasets)

In either case, you can get results faster by specifying a more powerful `instance_type`. When estimating costs, keep in mind that a more powerful instance can run the job faster. ML training scales proportionally to the size of your dataset and may take some time, so just keep this job running and check back later to see if it has completed.

In [10]:
hyperparameters = {"runtime": "fast"}  # change to "high_accuracy" for better results (will generally take longer)

Then, we create an estimator object for running a training job and train our model. For information on creating an `Estimator` object, check out the [documentation](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html).

In [11]:
estimator = sage.algorithm.AlgorithmEstimator(
    algorithm_arn=algo_arn,
    base_job_name="label-inspector",
    role=role,
    instance_count=1,
    instance_type=instance_type,
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sagemaker_session,
    hyperparameters=hyperparameters,
)

In [12]:
# run the training job
estimator.fit({"training": training_data})

INFO:sagemaker:Creating training-job with name: label-inspector-2023-03-14-19-11-44-226


2023-03-14 19:11:44 Starting - Starting the training job...
2023-03-14 19:12:01 Starting - Preparing the instances for training...
2023-03-14 19:12:43 Downloading - Downloading input data...
2023-03-14 19:13:03 Training - Downloading the training image............
2023-03-14 19:15:04 Training - Training image download completed. Training in progress..[34mAnalyzing your data ... the estimated maximum runtime (upper bound) is: 2 hours[0m

2023-03-14 19:15:45 Uploading - Uploading generated training model[34mLabel errors found.[0m

2023-03-14 19:15:55 Completed - Training job completed
Training seconds: 192
Billable seconds: 192


In [None]:
# your output will be available on following path
output_file_location = os.path.join(output_location, estimator._current_job_name, "output")
output_file_location

## 5. Fetch results of the label analysis

After the training job completes, we can get the results output by this analysis from the S3 bucket specified above. The main output of this solution is a CSV file with information about each label in your dataset. To learn more about what each column of the output file contains, check out our documentation [here](README.md#output). The examples flagged as likely mislabeled with the lowest label quality scores are the ones whose labels you should review closely.

In [None]:
s3 = boto3.client('s3')

# downloading the file from S3
with open('data/output/output.tar.gz', 'wb') as f:
    s3.download_fileobj(bucket, f"{base_folder_name}/output/{estimator._current_job_name}/output/output.tar.gz", f)

In [15]:
with tarfile.open('data/output/output.tar.gz') as file:
    file.extractall('data/output/')

In [16]:
pd.read_csv("data/output/cleanset.csv")

Unnamed: 0,is_label_issue,label_score,given_label,predicted_label
0,False,0.612850,C,C
1,False,0.657367,B,B
2,False,0.671640,B,B
3,False,0.599362,C,C
4,False,0.346935,C,C
...,...,...,...,...
936,False,0.510462,F,F
937,False,0.463695,F,F
938,False,0.686431,F,F
939,False,0.670583,B,B


## Note about Real-time and Batch Transform

Currently, real-time and batch inference has no effect for the Label Inspector, do not try using it. All results are returned at the end of the training job. 

We have a beta-version of the Label Inspector that supports real-time and batch inference, either to get predictions from a robust classifier model (trained with label issues fixed) or to evaluate the label quality of future test datapoints on the fly. To request access, email: support@cleanlab.ai

To run a more in-depth analysis of your data and labels, try [Cleanlab Studio](https://cleanlab.ai/studio/) for free (it supports image data as well).