This is a demo for the Data Enrichment preview feature. We welcome any feedback via Slack
@Aaron Zhang @Eytan

In [1]:
######## generate some data  ###################
import pandas as pd
import numpy as np

# Define the number of rows and columns
num_rows = 20
columns = ['Name', 'Age', 'Country', 'Salary']

# Generate random meaningful data
np.random.seed(42)  # For reproducibility
names = ['John', 'Emma', 'Michael', 'Sophia', 'Daniel', 'Olivia', 'Matthew', 'Ava', 'James', 'Isabella',
         'Henry', 'Mia', 'Alexander', 'Charlotte', 'William', 'Amelia', 'Benjamin', 'Harper', 'Lucas', 'Evelyn']
ages = np.random.randint(22, 60, size=num_rows)
countries = ['USA', 'UK', 'Canada', 'Australia', 'Germany', 'France', 'Spain', 'Italy', 'Netherlands', 'Sweden',
             'Norway', 'Denmark', 'Finland', 'Switzerland', 'Ireland', 'Belgium', 'Austria', 'Portugal', 'Greece', 'Poland']
salaries = np.random.randint(30000, 120000, size=num_rows)

# Create the dataframe
data = {
    'Name': names,
    'Age': ages,
    'Country': np.random.choice(countries, num_rows),
    'Salary': salaries
}
df = pd.DataFrame(data)

# Display the dataframe
df


Unnamed: 0,Name,Age,Country,Salary
0,John,50,Netherlands,83707
1,Emma,36,Spain,115305
2,Michael,29,Portugal,58693
3,Sophia,42,Australia,101932
4,Daniel,40,Switzerland,55658
5,Olivia,44,Portugal,114478
6,Matthew,32,Netherlands,48431
7,Ava,32,UK,32747
8,James,45,Poland,89150
9,Isabella,57,Ireland,95725


In [2]:
# upload the data
from cleanlab_studio import Studio
studio = Studio(<YOUR_API_KEY>)

In [3]:
dataset_id: str = studio.upload_dataset(df, "name_age_contry_salary")


Uploading dataset...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|
Ingesting Dataset...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|


In [4]:
# create enrichment project
enrichment_project = studio.create_enrichment_project(name="aaron_enrichment_preview_demo", dataset_id=dataset_id)

enrichment_project.id

'cd15cca0956043678812148a58a36d51'

In [7]:
ep = studio.get_enrichment_project(enrichment_project.id)
print(ep.id)

# construct preview inputs
from cleanlab_studio.studio.enrichment import EnrichmentOptions

# EnrichmentOptions can be used for both Preview and Enrich_All. After users are satisfied with the preview result,
# they can use the same EnrichmentOptions object to enrich the entire dataset.
enrichment_options = EnrichmentOptions(
    prompt="Is ${Country} a part of Europe?",
    constrain_outputs=["Yes", "No"],
    quality_preset="low",
    # regex = ...
    # tlm_options = ...
)

cd15cca0956043678812148a58a36d51


In [8]:
# the magic is about to happen!
indices=[3, 7, 11, 17, 19]  # If not set manually, the backend will pick 3 random rows. The random seed is fixed, so rows are "fixed".
preview_result = enrichment_project.preview(options=enrichment_options, new_column_name="Is_in_Europe", indices=indices)
preview_result

<cleanlab_studio.studio.enrichment.EnrichmentPreviewResult at 0x10e46a680>

In [9]:
preview_result.details()

Unnamed: 0_level_0,Is_in_Europe,Is_in_Europe_trustworthiness_score,Is_in_Europe_raw,Is_in_Europe_log
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3,No,0.847086,"No, Australia is not a part of Europe. It is a...",
7,Yes,0.922883,"Yes, the United Kingdom (UK) is a part of Euro...",
11,Yes,0.929441,"Yes, Denmark is a part of Europe. It is locate...",
17,No,0.846957,"No, Australia is not a part of Europe. It is a...",
19,Yes,0.928162,"Yes, Italy is a part of Europe. It is located ...",


In [10]:
# join with original data
preview_result.join(df)

Unnamed: 0,Name,Age,Country,Salary,Is_in_Europe
3,Sophia,42,Australia,101932,No
7,Ava,32,UK,32747,Yes
11,Mia,24,Denmark,65773,Yes
17,Harper,23,Australia,99092,No
19,Evelyn,54,Italy,71606,Yes


In [11]:
# join with original data with greater details
preview_result.join(df, with_details=True)

Unnamed: 0,Name,Age,Country,Salary,Is_in_Europe,Is_in_Europe_trustworthiness_score,Is_in_Europe_raw,Is_in_Europe_log
3,Sophia,42,Australia,101932,No,0.847086,"No, Australia is not a part of Europe. It is a...",
7,Ava,32,UK,32747,Yes,0.922883,"Yes, the United Kingdom (UK) is a part of Euro...",
11,Mia,24,Denmark,65773,Yes,0.929441,"Yes, Denmark is a part of Europe. It is locate...",
17,Harper,23,Australia,99092,No,0.846957,"No, Australia is not a part of Europe. It is a...",
19,Evelyn,54,Italy,71606,Yes,0.928162,"Yes, Italy is a part of Europe. It is located ...",


In [12]:
# one more thing. 
preview_result.get_preview_status()
### LOVE TO GET YOUR THOUGHTS ON THIS API  ###

{'is_timeout': False, 'completed_jobs_count': 5, 'failed_jobs_count': 0}

### Explanation of status fields:
`is_timeout`: Backend has a 2-minute timeout rule to protect ourselves and help customers. If any query to the LLM takes longer than 2 minutes, we will have a hard stop and return whatever we have already obtained. *NOTE* Queries to TLM are in parallel, meaning processing one row takes up to 2 minutes.

`completed_jobs_count`: Number of completed jobs.

`failed_jobs_count`: Number of failed jobs. Failures could be due to any uncaught backend errors preventing a return (not even an empty result for that row). This is surfaced so customers know why they might be missing a result. For example, if 5 rows were picked, but only 4 were returned from `details()`, this would indicate one failed on the backend.

More documentation is needed, but I hope this gives you a general idea.