# Identify Label Errors with Label Inspector from AWS Marketplace 


cleanlab's [Label Inspector (Tabular)](TODO: marketplace listing url) algorithm automatically detects label errors in your tabular classification dataset. All you need is your dataset containing class labels and the feature values of each datapoint, and we will flag examples that potentially have erroneous labels.

This sample notebook will show you how to use the Label Inspector in Amazon SageMaker.

> **Note**: This is a reference notebook and it cannot run unless you make changes suggested in the notebook.

## Pre-requisites
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. Some hands-on experience using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
1. To use this algorithm successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to [Label Inspector](TODO: marketplace listing url). 


## Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

## 1. Subscribe to the algorithm

To subscribe to the algorithm:
1. Open the algorithm listing page [Label Inspector](TODO: marketplace listing url).
1. On the AWS Marketplace listing,  click on **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you agree with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn**. This is the algorithm ARN that you need to specify to use this algorithm. Copy the ARN corresponding to your region and specify it in the following cell.

In [None]:
algo_arn = "<Specify the ARN for Label Inspector obtained from AWS Marketplace>"

## 2. Set up environment

In [None]:
import sagemaker as sage
from sagemaker import get_execution_role
import boto3
import os
import pandas as pd
import tarfile

In [None]:
sagemaker_session = sage.Session()
role = get_execution_role()

In [None]:
# define S3 locations for saving data, replace if you would like to store your data in alternative locations
bucket = sagemaker_session.default_bucket() 
base_folder_name = "label-inspector-tabular"  
input_folder = "{}/{}".format(base_folder_name, "input")
output_location = "s3://{}/{}/{}".format(bucket, base_folder_name, "output")

instance_type = "ml.m5.xlarge"  # replace with ml.m5.24xlarge if more powerful instance is required

## 3. Prepare dataset and Upload to Amazon S3 (skip if data is already in S3)

### View Sample Dataset 

Here is a sample input that is accepted by the Label Inspector pacakge. 

If using your own data, please ensure that the first line of your CSV file is a header with the column names for your data. The first column of your data must be the containing the class labels (remaining columns will be treated as predictive features). Also make sure that the labels are categorical strings (e.g. not continuous numbers, but discrete integers are okay), as only multi-class (and binary) classification datasets are supported. 

In [None]:
sample_input = pd.read_csv("data/input/dataset.csv")
sample_input.head(5)

### Upload datasets to Amazon S3

In [None]:
training_dataset = "data/input/dataset.csv"  # replace filepath here if using your own dataset

In [None]:
# upload data to S3
training_data_location = sagemaker_session.upload_data(training_dataset, bucket=bucket, key_prefix=input_folder)

In [None]:
# you can find your data here after uploading
training_data_location

## 4. Train a machine learning model

After ensuring that our data is an accessible Amazon S3 bucket (only read permissions are required for the algorithm to access the input data bucket), we are ready to train a machine learning model. 

In the code cell below we specify the S3 location of our data. If you have followed the [Upload datasets to Amazon S3](#Upload-datasets-to-Amazon-S3) section of this tutorial, the S3 location should already be specified. If you are using your own dataset, make sure to specify the location of your data below.

In [None]:
training_data = training_data_location  # replace with the S3 URI of your data if using your own dataset

Next, we specify the hyperparameters for our algorithm. Currently, our algorithm only supports one hyperparameter `runtime` with two options:

- `fast` will have a shorter execution time, but may not produce the best quality results
- `high_accuracy` will be slower, but produces high quality results

In [None]:
hyperparameters = {"runtime": "fast"}  # change to "high_accuracy" for better results (will generally take longer)

Then, we create an estimator object for running a training job and train our model. For information on creating an `Estimator` object, check out the [documentation](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html).

In [None]:
estimator = sage.algorithm.AlgorithmEstimator(
    algorithm_arn=algo_arn,
    base_job_name="label-inspector-tabular",
    role=role,
    instance_count=1,
    instance_type=instance_type,
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sagemaker_session,
    hyperparameters=hyperparameters,
)

In [None]:
# run the training job
estimator.fit({"training": training_data})

In [None]:
# your output will be available on following path
output_file_location = os.path.join(output_location, estimator._current_job_name, "output")
output_file_location

## 5. Vizualize output

After the training job is complete, we can get the output data from the S3 bucket specified above. Then we will show a sample of what the output data will look like. To learn more about what each column of the output file contains, check out our documentation on the [here](TODO: link to AWS page or readme).

In [None]:
s3 = boto3.client('s3')

# downloading the file from S3
with open('data/output/output.tar.gz', 'wb') as f:
    s3.download_fileobj(bucket, f"{base_folder_name}/output/{estimator._current_job_name}/output/output.tar.gz", f)

In [None]:
with tarfile.open('data/output/output.tar.gz') as file:
    file.extractall('data/output/')

In [None]:
pd.read_csv("data/output/cleanset.csv")

## Note about Real-time and Batch Transform

Currently, real-time and batch inference has no effect, do not try using it. All results are returned at the end of the training job. 

In the future, we will be supporting real-time and batch inference, both to get predictions from a robust ML model (trained with label issues fixed) and to evaluate the label quality of future test datapoints.