# Identify Errors in Your Dataset with Data Inspector from AWS Marketplace 


Cleanlab's [Data Inspector](TODO: link) automatically detects errors in your tabular datasets. 

This sample notebook demonstrates how to use the Data Inspector via Amazon SageMaker. You can either run it locally from your computer, or from within Sagemaker (recommended).

View our handy [AWS Marketplace Guide](../GUIDE.md) if you get stuck anywhere, especially with providing credentials/ARNs or other setup steps.

## Pre-requisites
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. Some hands-on experience using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
1. To use this algorithm successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to [Data Inspector](TODO: link). 

## Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

## 1. Subscribe to Label Inspector

To subscribe to the Label Inspector offering:
1. Open the AWS Marketplace listing page for [Data Inspector](TODO: link).
1. On the listing,  click on **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you agree with the EULA and pricing terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn**. This is the algorithm ARN that you need to specify to use this algorithm. Copy the ARN corresponding to your region and specify it in the following cell.

> **Note**: This is a reference notebook and it cannot run unless you enter the algorithm ARN in the call below.

In [1]:
algo_arn = "<Specify the ARN for Data Inspector obtained from AWS Marketplace>"

## 2. Set up environment

In [2]:
import sagemaker as sage
from sagemaker import get_execution_role
import boto3
import os
import pandas as pd
import tarfile

The two code cells below might have a different setup if you are running this sample notebook locally, please check out the [guide to run sample notebooks locally](../GUIDE.md/#run-sample-notebooks-locally) for more information.

In [3]:
try:
    session = boto3.Session()
    sagemaker_session = sage.Session(session)

except ValueError:
    # AWS access key id and secret access key only needs to be specified if running notebook locally 
    # (and AWS credentials were not previously setup)
    aws_access_key_id = "<Specify your AWS Access ID>"
    aws_secret_access_key = "<Specify your AWS Secret Access Key>"
    region = "us-east-1"  # replace with other region if you want, ensure that it matches the region in the ARN
    session = boto3.Session(aws_access_key_id, aws_secret_access_key, region_name=region)
    sagemaker_session = sage.Session(session)

In [4]:
# local variable only needs to be specified if running notebook locally rather than in Sagemaker
local_variable_for_sm_role = "arn:aws:iam::XXXXXX:role/service-role/SageMaker-XXXXX"  

try:
    role = get_execution_role()
except ValueError:
    role = local_variable_for_sm_role

In [5]:
# define S3 locations for saving data, replace if you would like to store your data in alternative locations
bucket_name = sagemaker_session.default_bucket()  # bucket where data will be stored
base_folder_name = "data-inspector"  # folder inside your bucket where data will be stored

training_instance_type = "ml.m5.xlarge"  # what type of EC2 instance to use (i.e. how powerful of a computer)

The choice of EC2 instance will affect how much data can be handled (due to memory limits), how long it takes to return results (ML training takes time), and possibly how accurate the results are. More powerful instances will improve things along all these dimensions.

If your dataset contains text fields (strings that are not discrete categories), we recommend a p*-instance that has GPU such that large language models can be fine-tuned on your data. Use of GPU will produce more accurate results for datasets with text.

If your dataset is big (over 100k rows), we recommend an instance with lots of memory: "ml.m5.24xlarge" if there are no text fields, "ml.p3.16xlarge" otherwise.

## 3. Prepare dataset and Upload to Amazon S3 (skip if data is already in S3)

### View Sample Dataset 

Here is an example dataset that you can run Data Inspector on. Data inspector will take approximately 10 minutes to train various ML model and identify any potential errors in this sample dataset.

If using your own data, please ensure that the first line of your CSV file is a header with the column names for your data. Your data can optionally contain a unique index or ID column, which can be specified to Data Inspector later on (eg. `stud_ID` is the unique ID column for this sample dataset).

In [6]:
dataset = pd.read_csv("data/input/dataset.csv")
dataset.head(5)

Unnamed: 0,stud_ID,name,exam_1,exam_2,exam_3,notes,letter_grade
0,f48f73,Nicole Carter,53,77,93,,C
1,0bd4e7,Tammy Myers,81,64,80,great participation +10,B
2,e1795d,Lillian Lucas,74,88,97,,B
3,cb9d7a,Danielle Graham,61,94,78,,C
4,9acca4,Wilbur Fleming,48,90,91,,C


### Upload datasets to Amazon S3

In [7]:
training_dataset = "data/input/dataset.csv"  # replace filepath here if using your own dataset

In [8]:
# upload data to S3
input_folder = "{}/{}/{}".format(base_folder_name, "train", "input")
training_data_location = sagemaker_session.upload_data(training_dataset, bucket=bucket_name, key_prefix=input_folder)

In [None]:
# you can find your data here after uploading
training_data_location

## 4. Train a ML model to analyze the labels in our dataset

After ensuring that our data is an accessible Amazon S3 bucket (only read permissions are required for the algorithm to access the input data bucket), we are ready to train a machine learning model. This model can be automatically trained on diverse types of tabular/text data via state-of-the-art AutoML, and is used to identify which labels are likely incorrect. 

In the code cell below we specify the S3 location of our data. If you have followed the [Upload datasets to Amazon S3](#Upload-datasets-to-Amazon-S3) section of this tutorial, the S3 location should already be specified. If you are using your own dataset, make sure to specify the location of your data below.

In [10]:
training_data = training_data_location  # replace with the S3 URI of your data if using your own dataset

In [11]:
# formatting the output folder location based on variables specified above
# this is boilerplate code and does not need to be edited
output_location = "s3://{}/{}/{}/{}".format(bucket_name, base_folder_name, "train", "output")  

Next, we specify the hyperparameters for our algorithm. Data Inspector supports 3 hyperparameters:

- The `runtime` argument specifies the training mode, there are two options:
    1. `fast` will have a shorter execution time, but may not produce the best quality results (maximum execution time: 3 hours, will take much less time for most datasets)
    2. `high_accuracy` will be slower, but produces high quality results (maximum execution time: 15 hours, will take much less time for most datasets)

    In either case, you can get results faster by specifying a more powerful `instance_type` (and the results may be more accurate). When estimating costs, keep in mind that a more powerful instance can run the job faster (and has more memory available which may be required for larger datasets). ML training scales proportionally to the size of your dataset and may take some time, so just keep this job running and check back later to see if it has completed.
    
- The `index_col` argument specifies the column to use as the index of the dataset. This index column name should be passed in as a string (eg. `"stud_ID"`).

- The `columns_to_inspect` argument specifies the columns that should be checked for data issues. This should be a list of column names in your dataset (eg. `["letter_grade", "exam_1"]`).

In [12]:
hyperparameters = {
    "runtime": "high_accuracy",  # change to "fast" to get quicker results (will be less accurate)
    "index_col": "stud_ID",  # specify
    # "columns_to_inspect": ["letter_grade", "exam_1"]  # can specify a subset of columns to inspect
}  

Then, we create an estimator object for running a training job and train our model. For information on creating an `Estimator` object, check out the [documentation](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html).

In [13]:
estimator = sage.algorithm.AlgorithmEstimator(
    algorithm_arn=algo_arn,
    base_job_name="data-inspector",
    role=role,
    instance_count=1,
    instance_type=training_instance_type,
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sagemaker_session,
    hyperparameters=hyperparameters,
)

In [14]:
# run the training job
estimator.fit({"training": training_data})

INFO:sagemaker:Creating training-job with name: data-inspector-2023-05-12-04-25-05-478


2023-05-12 04:25:05 Starting - Starting the training job...
2023-05-12 04:25:19 Starting - Preparing the instances for training...
2023-05-12 04:26:10 Downloading - Downloading input data...
2023-05-12 04:26:30 Training - Downloading the training image.....................
2023-05-12 04:30:06 Training - Training image download completed. Training in progress..
[34mAnalyzing your data ... the estimated maximum runtime (upper bound) is: 15 hours[0m
[34mColumns ['name'] were not inspected.[0m

2023-05-12 04:36:49 Uploading - Uploading generated training model
2023-05-12 04:36:49 Completed - Training job completed
Training seconds: 638
Billable seconds: 638


In [None]:
# your output will be available on following path
output_file_location = os.path.join(output_location, estimator._current_job_name, "output")
output_file_location

### Some Model Training Tips

To get results faster (but potentially less accurate estimates): Specify a more powerful instance type and set `hyperparameters = {"runtime": "fast"}`. You can also try subsampling your dataset to a smaller one for a quick trial run (although be aware the ML training -- and hence accuracy of estimated label issues -- will become worse when the dataset is small). 

To get more accurate results: Specify a more powerful instance type. If you have text fields in your dataset (i.e. strings that are not discrete categories), use an instance that has a GPU (like a p-instance or g-instance) so large language models can be fine-tuned on your data. Also set `hyperparameters = {"runtime": "high_accuracy"}`.

## 5. Fetch results of the data analysis

After the training job completes, we can get the results output by this analysis from the S3 bucket specified above. 

In [16]:
s3 = session.client('s3')

# create local folder to store output files
if not os.path.exists('data/output'):
    os.makedirs('data/output')

# downloading the file from S3
with open('data/output/output.tar.gz', 'wb') as f:
    s3.download_fileobj(bucket_name, f"{base_folder_name}/train/output/{estimator._current_job_name}/output/output.tar.gz", f)
    
# extracting the downloaded tar.gz file to a outpuit folder
with tarfile.open('data/output/output.tar.gz') as file:
    file.extractall('data/output/')

Data Inspector outputs 3 CSV files containing information about each datapoint in your dataset:

- `is_issue.csv` contains boolean True/False values specifying whether each datapoint is inferred to be an error
- `quality_score.csv` contains quality scores between 0 and 1 estimating the likelihood that each datapoint is an error (lower scores indicate noiser data)
- `imputed_values.csv` contains a model predicted value for each datapoint

All 3 CSV files that are returned will contain the same number of rows as your original dataset, and the columns which you have specified in `columns_to_inspect`. If `columns_to_inspect` was not specified, then they will contains all the columns in your original dataset. Each error boolean value, quality score and imputed value will correspond to the data at each row and column of your input dataset. 

The datapoints flagged as likely mislabeled with the lowest label quality scores are the ones whose labels you should review closely. 

Note that columns that were skipped during the inspection will always have `is_issue = False` and `quality_score = 1` for all rows. For example, in this sample dataset the `name` column contains unique string names, which does not make sense to be inspected, hence there will be no results for that column.


Let's load all the results alongside our input data. We can also view an example of some of the return files.

In [17]:
original_dataset = pd.read_csv("data/input/dataset.csv", index_col="stud_ID") # replace filepath here if using your own dataset
is_issue = pd.read_csv("data/output/is_issue.csv", index_col="stud_ID")
quality_score = pd.read_csv("data/output/quality_scores.csv", index_col="stud_ID")
imputed_values = pd.read_csv("data/output/imputed_values.csv", index_col="stud_ID")

In [18]:
is_issue.head(3)

Unnamed: 0_level_0,name,exam_1,exam_2,exam_3,notes,letter_grade
stud_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
f48f73,False,False,False,False,False,False
0bd4e7,False,False,False,False,True,False
e1795d,False,False,False,False,False,False


In [19]:
quality_score.head(3)

Unnamed: 0_level_0,name,exam_1,exam_2,exam_3,notes,letter_grade
stud_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
f48f73,1,0.936789,0.924476,0.85422,1.0,0.811594
0bd4e7,1,0.900119,0.860793,0.876085,0.139237,0.771159
e1795d,1,0.897646,0.893876,0.704493,1.0,0.490956


Next, we define a function that will help us easily inspect the results for each column. The `inspect_column` function will take in the name of the column that you want to check for errors, and return a DataFrame that includes the original dataset alongside the three new columns for that column: `is_issue`, `quality_score`, and `imputed_value` which have been extracted from the CSV files returned by Data Inspector.

In [20]:
def inspect_column(column_name):
    is_issue_col = is_issue[column_name].rename(f"{column_name}_is_issue")
    quality_score_col = quality_score[column_name].rename(f"{column_name}_quality_score")
    imputed_values_col = imputed_values[column_name].rename(f"{column_name}_imputed_value")

    merged_df = pd.concat([original_dataset, is_issue_col, quality_score_col, imputed_values_col], axis=1)
    return merged_df

First, we will take a closer look at the `letter_grade` column. Here we obtain the concatenated DataFrame using the function defined above, then filter for the examples where `is_issue` is `True` and sort the values by `quality_score` to view the top errors. 

In [21]:
letter_grade_df = inspect_column("letter_grade")
letter_grade_issues = letter_grade_df[letter_grade_df["letter_grade_is_issue"] == True].sort_values("letter_grade_quality_score")
letter_grade_issues.head()

Unnamed: 0_level_0,name,exam_1,exam_2,exam_3,notes,letter_grade,letter_grade_is_issue,letter_grade_quality_score,letter_grade_imputed_value
stud_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0bdad5,Courtney Richardson,62,0,42,"cheated on exam, gets 0pts",B,True,3e-06,F
88e562,Ana Wells,98,80,89,great final presentation +10,C,True,0.002002,A
74676b,Marion Wilkerson,81,100,74,,F,True,0.006147,B
5eef2c,Andy Woods,90,78,81,,A,True,0.00859,B
1803b9,Patrick Stewart,75,0,55,"cheated on exam, gets 0pts",A,True,0.009317,F


The grades for these students do look extremely suspicious. The `imputed_value` columns provides suggested values for the datapoint that has been indentified as an error. 

We can view another example of potential data issues. Let's check out the `exam_1` column and similarly view the top errors in that column.

In [22]:
exam_1_df = inspect_column("exam_1")
exam_1_issues = exam_1_df[exam_1_df["exam_1_is_issue"] == True].sort_values("exam_1_quality_score")
exam_1_issues.head()

Unnamed: 0_level_0,name,exam_1,exam_2,exam_3,notes,letter_grade,exam_1_is_issue,exam_1_quality_score,exam_1_imputed_value
stud_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2caa08,Nicholas Richardson,980,94,90,great participation +10,A,True,4.071307e-11,89
d30e8a,Matthew Fleming,0,89,91,<p><samp>Invalid entry.</p>,A,True,0.1073906,83
b4d929,Timothy Turner,172,70,68,,C,True,0.1155312,91
86bd1a,Bernadette Larson,-20,38,29,,F,True,0.1692855,46
e1dffd,Roosevelt Francis,69,94,95,great participation +10,A,True,0.2287622,123


Similarly, we see that these exam scores are likely erroneous. By easily repeating this with each column in your dataset, this is a straightforward method to inspect your data for any potential errors.

The best way to ensure that you have high quality data is to manually inspect the datapoints that Data Inspector has identified to have issues (ie. `is_issue = True`) and correct them. However, that could be very time consuming and hence you can also quickly improve your data by replacing the entries that have been flaged as issues with the suggested imputed values.

We demonstrate how to automatically obtain an improved dataset below, where `improved_dataset` will have the exact same rows and columns as your original dataset, but with the erroneous rows replaced with a suggested value.

In [23]:
improved_dataset = original_dataset.copy()

for col in is_issue.columns:
    improved_dataset[col].mask(is_issue[col], imputed_values[col], inplace=True)
    
improved_dataset.head()

Unnamed: 0_level_0,name,exam_1,exam_2,exam_3,notes,letter_grade
stud_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
f48f73,Nicole Carter,53,77,93,,C
0bd4e7,Tammy Myers,81,64,80,great final presentation +10,B
e1795d,Lillian Lucas,74,88,97,,B
cb9d7a,Danielle Graham,61,94,78,,C
9acca4,Wilbur Fleming,48,90,91,,C


## Note about Real-time and Batch Transform

Currently, real-time and batch inference has no effect for the Data Inspector, do not try using it. All results are returned at the end of the training job.

## Clean Up Resources (Optional)

The code cell below contains the code that can be used to automatically delete the files created in S3 in this sample notebook. Be wary that this command will delete all the files stored in the `base_folder_name` folder specified above. **Proceed with caution** if you have previously stored other files in that folder.

Alternatively, you can navigate to the [S3 Management Console](https://s3.console.aws.amazon.com/) and manually delete these data files yourself.

In [24]:
# resource = session.resource('s3')
# resource_bucket = resource.Bucket(bucket_name)

# # before executing the code below, ensure there is nothing else important in this folder
# resource_bucket.objects.filter(Prefix=f"{base_folder_name}/").delete()  

## Additional Support

To ask questions or report problems, please email: support@cleanlab.ai and specify that you are using Data Inspector in AWS Marketplace in the subject line.

To run a more in-depth analysis of your data and labels, try [Cleanlab Studio](https://cleanlab.ai/studio/) for free (it supports image data as well).