# Identify Label Errors with Label Inspector from AWS Marketplace 


Cleanlab's [Label Inspector](https://aws.amazon.com/marketplace/pp/prodview-rlbhc2lxttdio) automatically detects label errors in your classification dataset. You just need a (tabular or text) dataset containing class labels and feature values for each datapoint, and this solution will flag examples that are likely mislabeled.

This sample notebook demonstrates how to use the Label Inspector via Amazon SageMaker. You can either run it locally from your computer, or from within Sagemaker (recommended).

View our handy [AWS Marketplace Guide](../GUIDE.md) if you get stuck anywhere, especially with providing credentials/ARNs or other setup steps.

## Pre-requisites
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. Some hands-on experience using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
1. To use this algorithm successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to [Label Inspector](https://aws.amazon.com/marketplace/pp/prodview-rlbhc2lxttdio). 

## Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

## 1. Subscribe to Label Inspector

To subscribe to the Label Inspector offering:
1. Open the AWS Marketplace listing page for [Label Inspector](https://aws.amazon.com/marketplace/pp/prodview-rlbhc2lxttdio).
1. On the listing,  click on **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you agree with the EULA and pricing terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn**. This is the algorithm ARN that you need to specify to use this algorithm. Copy the ARN corresponding to your region and specify it in the following cell.

> **Note**: This is a reference notebook and it cannot run unless you enter the algorithm ARN in the call below.

In [1]:
algo_arn = "<Specify the ARN for Label Inspector obtained from AWS Marketplace>"

## 2. Set up environment

In [2]:
import sagemaker as sage
from sagemaker import get_execution_role
import boto3
import os
import pandas as pd
import tarfile

The two code cells below might have a different setup if you are running this sample notebook locally, please check out the [guide to run sample notebooks locally](../GUIDE.md/#run-sample-notebooks-locally) for more information.

In [3]:
try:
    session = boto3.Session()
    sagemaker_session = sage.Session(session)

except ValueError:
    # AWS access key id and secret access key only needs to be specified if running notebook locally 
    # (and AWS credentials were not previously setup)
    aws_access_key_id = "<Specify your AWS Access ID>"
    aws_secret_access_key = "<Specify your AWS Secret Access Key>"
    region = "us-east-1"  # replace with other region if you want, ensure that it matches the region in the ARN
    session = boto3.Session(aws_access_key_id, aws_secret_access_key, region_name=region)
    sagemaker_session = sage.Session(session)

In [4]:
# local variable only needs to be specified if running notebook locally rather than in Sagemaker
local_variable_for_sm_role = "arn:aws:iam::XXXXXX:role/service-role/SageMaker-XXXXX"  

try:
    role = get_execution_role()
except ValueError:
    role = local_variable_for_sm_role

In [5]:
# define S3 locations for saving data, replace if you would like to store your data in alternative locations
bucket_name = sagemaker_session.default_bucket()  # bucket where data will be stored
base_folder_name = "label-inspector"  # folder inside your bucket where data will be stored

training_instance_type = "ml.m5.xlarge"  # what type of EC2 instance to use (i.e. how powerful of a computer)

The choice of EC2 instance will affect how much data can be handled (due to memory limits), how long it takes to return results (ML training takes time), and possibly how accurate the results are. More powerful instances will improve things along all these dimensions.

If your dataset contains text fields (strings that are not discrete categories), we recommend a p*-instance that has GPU such that large language models can be fine-tuned on your data. Use of GPU will produce more accurate results for datasets with text.

If your dataset is big (over 100k rows), we recommend an instance with lots of memory: "ml.m5.24xlarge" if there are no text fields, "ml.p3.16xlarge" otherwise.

## 3. Prepare dataset and Upload to Amazon S3 (skip if data is already in S3)

### View Sample Dataset 

Here is an example dataset that you can run Label Inspector on. Label inspector will take approximately 5 minutes to train a ML model and identify label errors on this example dataset.

If using your own data, please ensure that the first line of your CSV file is a header with the column names for your data. The first column of your data must be the containing the class labels (remaining columns will be treated as predictive features). Also make sure that the labels are categorical strings (e.g. not continuous numbers, but discrete integers are okay), as only multi-class (and binary) classification datasets are supported. 

In [6]:
dataset = pd.read_csv("data/train/input/train_dataset.csv")
dataset.head(5)

Unnamed: 0,letter_grade,exam_1,exam_2,exam_3,notes
0,C,53,77,93,
1,B,81,64,80,great participation +10
2,B,74,88,97,
3,C,61,94,78,
4,C,48,90,91,


### Upload datasets to Amazon S3

In [7]:
training_dataset = "data/train/input/train_dataset.csv"  # replace filepath here if using your own dataset

In [8]:
# upload data to S3
input_folder = "{}/{}/{}".format(base_folder_name, "train", "input")
training_data_location = sagemaker_session.upload_data(training_dataset, bucket=bucket_name, key_prefix=input_folder)

In [None]:
# you can find your data here after uploading
training_data_location

## 4. Train a ML model to analyze the labels in our dataset

After ensuring that our data is an accessible Amazon S3 bucket (only read permissions are required for the algorithm to access the input data bucket), we are ready to train a machine learning model. This model can be automatically trained on diverse types of tabular/text data via state-of-the-art AutoML, and is used to identify which labels are likely incorrect. 

In the code cell below we specify the S3 location of our data. If you have followed the [Upload datasets to Amazon S3](#Upload-datasets-to-Amazon-S3) section of this tutorial, the S3 location should already be specified. If you are using your own dataset, make sure to specify the location of your data below.

In [10]:
training_data = training_data_location  # replace with the S3 URI of your data if using your own dataset

In [11]:
# formatting the output folder location based on variables specified above
# this is boilerplate code and does not need to be edited
output_location = "s3://{}/{}/{}/{}".format(bucket_name, base_folder_name, "train", "output")  

Next, we specify the hyperparameters for our algorithm. Currently, our algorithm only supports one hyperparameter `runtime` with two options:

- `fast` will have a shorter execution time, but may not produce the best quality results (maximum execution time: 2 hours, will take much less time for most datasets)
- `high_accuracy` will be slower, but produces high quality results (maximum execution time: 13 hours, will take much less time for most datasets)

In either case, you can get results faster by specifying a more powerful `instance_type` (and the results may be more accurate). When estimating costs, keep in mind that a more powerful instance can run the job faster (and has more memory available which may be required for larger datasets). ML training scales proportionally to the size of your dataset and may take some time, so just keep this job running and check back later to see if it has completed.

In [12]:
hyperparameters = {"runtime": "high_accuracy"}  # change to "fast" to get quicker results (will be less accurate)

Then, we create an estimator object for running a training job and train our model. For information on creating an `Estimator` object, check out the [documentation](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html).

In [13]:
estimator = sage.algorithm.AlgorithmEstimator(
    algorithm_arn=algo_arn,
    base_job_name="label-inspector",
    role=role,
    instance_count=1,
    instance_type=training_instance_type,
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sagemaker_session,
    hyperparameters=hyperparameters,
)

In [14]:
# run the training job
estimator.fit({"training": training_data})

INFO:sagemaker:Creating training-job with name: label-inspector-2023-03-24-02-28-53-009


2023-03-24 02:28:53 Starting - Starting the training job...
2023-03-24 02:29:07 Starting - Preparing the instances for training...
2023-03-24 02:29:49 Downloading - Downloading input data...
2023-03-24 02:30:14 Training - Downloading the training image.........
2023-03-24 02:31:50 Training - Training image download completed. Training in progress...[34mAnalyzing your data ... the estimated maximum runtime (upper bound) is: 13 hours[0m
[34mLabel errors found.[0m

2023-03-24 02:33:46 Uploading - Uploading generated training model
2023-03-24 02:33:46 Completed - Training job completed
Training seconds: 238
Billable seconds: 238


In [15]:
# your output will be available on following path
output_file_location = os.path.join(output_location, estimator._current_job_name, "output")
output_file_location

's3://sagemaker-us-east-2-043170249292/label-inspector/train/output/label-inspector-2023-03-24-02-28-53-009/output'

## 5. Fetch results of the label analysis

After the training job completes, we can get the results output by this analysis from the S3 bucket specified above. The main output of this solution is a CSV file with information about each label in your dataset. To learn more about what each column of the output file contains, check out our documentation [here](README.md#output). The examples flagged as likely mislabeled with the lowest label quality scores are the ones whose labels you should review closely.

In [16]:
s3 = session.client('s3')

# downloading the file from S3
with open('data/train/output/output.tar.gz', 'wb') as f:
    s3.download_fileobj(bucket_name, f"{base_folder_name}/train/output/{estimator._current_job_name}/output/output.tar.gz", f)

In [17]:
with tarfile.open('data/train/output/output.tar.gz') as file:
    file.extractall('data/train/output/')

In [18]:
cleanset = pd.read_csv("data/train/output/cleanset.csv")
cleanset

Unnamed: 0,is_label_issue,label_score,given_label,predicted_label
0,False,0.672778,C,C
1,False,0.691504,B,B
2,False,0.729446,B,B
3,False,0.662407,C,C
4,False,0.458495,C,C
...,...,...,...,...
936,False,0.377611,F,F
937,False,0.417643,F,F
938,False,0.733176,F,F
939,False,0.755515,B,B


Each row in the returned cleanset corresponds to a row in your original dataset. You can concatenate these two files into one to better review the identified label errors. We show an example of how to view the top errors in our sample dataset below:

In [19]:
original_dataset = pd.read_csv("data/train/input/train_dataset.csv")  # replace filepath here if using your own dataset
merged_cleanset = pd.concat([original_dataset, cleanset], axis=1)

Then, we can view the top label errors that was identified by filtering for the examples where `is_label_issue` is `True`, and sorting them by their `label_score` (lower score indicates more errorneous examples).

In [20]:
label_errors = merged_cleanset[merged_cleanset["is_label_issue"] == True].sort_values("label_score")
label_errors.head()

Unnamed: 0,letter_grade,exam_1,exam_2,exam_3,notes,is_label_issue,label_score,given_label,predicted_label
318,F,98,51,74,,True,0.012267,F,C
705,A,97,0,90,"cheated on exam, gets 0pts",True,0.01468,A,D
689,B,77,51,70,,True,0.015814,B,D
597,F,81,100,74,,True,0.018192,F,B
619,F,89,77,68,,True,0.0226,F,C


## 6. Deploy trained model and perform real time inference

Label inspector trains a ML model to identify potential label errors. After the training is completed, you can deploy the trained model as an endpoint and perform real time inference on any new data that you have. 

If you want to understand how real-time inference with Amazon SageMaker works, check out this [AWS documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html).

In [21]:
# define instance type and endpoint name
real_time_inference_instance_type = "ml.m5.xlarge"  # replace with a large instance if necessary
endpoint_name = "label-inspector-predictor"  # note that endpoint names have to be unique

content_type = "text/csv"  # label inspector only supports csv file, do not change this

In [22]:
predictor = estimator.deploy(
    initial_instance_count=1, 
    instance_type=real_time_inference_instance_type, 
    endpoint_name=endpoint_name,
)

INFO:sagemaker:Creating model package with name: label-inspector-2023-03-24-02-34-09-683


................

INFO:sagemaker:Creating model with name: label-inspector-2023-03-24-02-34-09-683





INFO:sagemaker:Creating endpoint-config with name label-inspector-predictor
INFO:sagemaker:Creating endpoint with name label-inspector-predictor


--------!

Now your endpoint is deployed, we can run real time inference on a dataset to get the predicted labels.

Here we will use an example dataset that we want to perform inference on. If using your own data, please ensure that the data passed to the inference job has the same columns as the training data (the only column that is not needed is the label column).

In [23]:
# specify the input and output filepaths
real_time_input_file = "data/real_time_inference/input/inference_dataset.csv"  # replace filepath if using your own dataset
real_time_output_file = "data/real_time_inference/output/predictions.csv"  # replace filepath if you want to save the outputs somewhere else

In [24]:
# here is the sample dataset for inference, notice how the columns are the same as the training data
inference_dataset = pd.read_csv(real_time_input_file)
inference_dataset.head()

Unnamed: 0,exam_1,exam_2,exam_3,notes
0,82,69,75,great final presentation +10
1,94,90,90,great participation +10
2,92,63,91,missed homework frequently -10
3,91,0,68,"cheated on exam, gets 0pts"
4,81,56,58,


Then, we can invoke the deployed endpoint and save the results to a file.

In [25]:
runtime = session.client('sagemaker-runtime')
endpoint_name = predictor.endpoint_name

# read file into memory
with open(real_time_input_file, encoding="utf-8") as f:
    data_input = f.read()
    
# invoke endpoint to perform inference
response = runtime.invoke_endpoint(EndpointName=endpoint_name, ContentType='text/csv', Body=data_input)

# unpack response and save to file
with open(real_time_output_file, 'wb') as file:
    file.write(response['Body'].read())

You can also invoke the endpoint with the following CLI command:

```shell
!aws sagemaker-runtime invoke-endpoint --endpoint-name $endpoint_name --body fileb://$input_file_name --content-type $content_type --region $sagemaker_session.boto_region_name $output_file_name
```

Let us view the predictions from our real time inference:

In [26]:
real_time_preds = pd.read_csv(real_time_output_file)
real_time_preds.head()

Unnamed: 0,letter_grade
0,B
1,A
2,C
3,F
4,D


## 7. Perform batch inference

We can also perform batch inference using our trained model, which allows multiple input payloads at a time (usually stored in S3).

If you want to learn more about batch transform, check out this [AWS documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html)

In [27]:
batch_transform_inference_instance_type = "ml.m5.xlarge"  # replace with a large instance if necessary
content_type = "text/csv"  # label inspector only supports csv file, do not change this

Since batch transform jobs can take in multiple payloads, we upload data from a folder (instead of an individual file) to an S3 folder. We will then pass in that S3 folder to the batch transform job which can perform inference on all files in that folder.

In [33]:
# speficy location of local folder that we want to upload to S3 and download results to
# replace this local folder if you have your data stored somewhere else
transform_input_folder = "data/batch_inference/input/"  
transform_output_folder = "data/batch_inference/output/"  

In [34]:
# specify the input and output folders paths on S3
transform_input_path = "{}/{}/{}".format(base_folder_name, "inference", "input")
transform_output_path = "s3://{}/{}/{}/{}".format(bucket_name, base_folder_name, "inference", "output")  

Next, we upload the data in our specified local folder to a S3 folder. Our local folder (specified using `transform_input_folder`) contains one csv file that we want to perform batch transform on. Note that you can have as many files as you want in that folder as long as they are valid files to perform batch transform on.

Skip this step if your data is already in S3, and just specify where your data is using the `transform_input_location` variable.

In [None]:
# upload the batch inference input files to S3 (skip if data is already in S3), just specify transform_input_location
transform_input_location = sagemaker_session.upload_data(transform_input_folder, bucket=bucket_name, key_prefix=transform_input_path)

print("Transform input uploaded to " + transform_input_location)

After ensuring that your data is on S3, you can run the batch transform by passing in the S3 location of where your data is stored. The batch transform will save the results of the batch transform to an output folder in S3.

In [None]:
transformer = estimator.transformer(1, batch_transform_inference_instance_type, output_path=transform_output_path)
transformer.transform(transform_input_location, content_type=content_type)
transformer.wait()

In [37]:
# output is available on following path
transformer.output_path

's3://sagemaker-us-east-2-043170249292/label-inspector/inference/output'

After the batch transform job is complete, we can get the output data from the S3 bucket specified above. Here, we will download the entire S3 folder, so if you had more than one input files, the follow code cell will download all the corresponding output files/

Then, we will show a sample of what the output data will look like.

In [38]:
resource = session.resource('s3')
resource_bucket = resource.Bucket(bucket_name)

for object in resource_bucket.objects.filter(Prefix=f'{base_folder_name}/inference/output/'):
    filename = object.key.split("/")[-1]
    resource_bucket.download_file(object.key, f"{transform_output_folder}/{filename}")

In [39]:
batch_transform_preds = pd.read_csv(f"{transform_output_folder}/inference_dataset.csv.out")
batch_transform_preds.head()

Unnamed: 0,letter_grade
0,B
1,A
2,C
3,F
4,D


## 8. Clean Up Resources 

### Delete models endpoints

If you are done with performing real-time inferences, you no longer need the saved model and deployed endpoint. Here we show how to terminate the endpoint to avoid being charged.

In [40]:
# predictor.delete_model()
# predictor.delete_endpoint(delete_endpoint_config=True)

You can also delete the model associated with the batch transform job.

In [41]:
# transformer.delete_model()

### Delete S3 resources (optional)

The code cell below contains the code that can be used to automatically delete the files created in S3 in this sample notebook. Be wary that this command will delete all the files stored in the `base_folder_name` folder specified above. Proceed with caution if you have previously stored other files in that folder.

Alternatively, you can navigate to the [S3 Management Console](https://s3.console.aws.amazon.com/) and manually delete these data files yourself.

In [42]:
# resource = session.resource('s3')
# resource_bucket = resource.Bucket(bucket_name)

# before executing the code below, ensure there is nothing else important in this folder
# resource_bucket.objects.filter(Prefix=f"{base_folder_name}/").delete()  

## Other Tips

To get results faster (but potentially less accurate estimates): Specify a more powerful instance type and set `hyperparameters = {"runtime": "fast"}`. You can also try subsampling your dataset to a smaller one for a quick trial run (although be aware the ML training -- and hence accuracy of estimated label issues -- will become worse when the dataset is small). 

To get more accurate results: Specify a more powerful instance type. If you have text fields in your dataset (i.e. strings that are not discrete categories), use an instance that has a GPU (like a p-instance or g-instance) so large language models can be fine-tuned on your data. Also set `hyperparameters = {"runtime": "high_accuracy"}`.

To ask questions or report problems, please email: support@cleanlab.ai

To run a more in-depth analysis of your data and labels, try [Cleanlab Studio](https://cleanlab.ai/studio/) for free (it supports image data as well).