# Uploading Data from BigQuery to Cleanlab Studio

In this tutorial, you’ll learn how to upload data from BigQuery to Cleanlab Studio. You’ll start by creating a table in BigQuery, then configure access by enabling the Cleanlab Studio GCP service account. Finally, you’ll use the Python client to upload your table. This guide will help you integrate your BigQuery data into Cleanlab Studio efficiently.

This notebook uses the BigQuery client library, along with the `cleanlab-studio` Python Package.

## 1. Install and import dependencies

You'll need to install the `cleanlab-studio` package, along with the `google-cloud-bigquery` package. Additionally, you will need the `requests` library to download the example dataset.

### 1a. Install the required packages

Required packages are installed using `pip`.

In [None]:
%pip install cleanlab-studio --upgrade
%pip install google-cloud-bigquery requests

In [None]:
from cleanlab_studio import Studio
from google.cloud import bigquery
import requests

### 1b. Create BigQuery, Cleanlab Studio clients
To make API calls to BigQuery and Cleanlab Studio, you need to create clients for both services.

This tutorial assumes you have already authenticated your Google Cloud account. If you haven't, you can follow the instructions in the [Google Cloud documentation](https://cloud.google.com/docs/authentication/client-libraries).

Ensure that you set the `GCP_PROJECT` variable along with the Cleanlab Studio API key in the following block.

In [9]:
# create a BigQuery client
GCP_PROJECT = "<your-gcp-project>"
bigquery_client = bigquery.Client(project=GCP_PROJECT)

# create a Studio client
# you can find your Cleanlab Studio API key by going to app.cleanlab.ai/account
API_KEY = "<YOUR_API_KEY>"
studio = Studio(API_KEY)

## 2. Upload an example dataset to BigQuery (optional)

You can use the following code to upload an example dataset to BigQuery. This dataset is an example customer support dataset that contains two columns.

If you already have a dataset in BigQuery, you can skip this step -- just ensure that you set the `BIGQUERY_DATASET` and `BIGQUERY_TABLE` variables to the correct values.

In [6]:
# Set BigQuery dataset, and table
BIGQUERY_DATASET = "cleanlab_studio_demo"
BIGQUERY_TABLE = "cleanlab_studio_banking"

LOCAL_DATASET_PATH = "/tmp/studio_bigquery_dataset.csv"

**Optional: download example dataset**



In [7]:
with open(LOCAL_DATASET_PATH, "wb") as outf:
    resp = requests.get("https://cleanlab-public.s3.amazonaws.com/Datasets/banking-text-quickstart-v1.csv")
    outf.write(resp.content)

**Optional: upload example dataset to BigQuery**



In [16]:
bigquery_dataset_id = f"{GCP_PROJECT}.{BIGQUERY_DATASET}"
bigquery_table_id = f"{bigquery_dataset_id}.{BIGQUERY_TABLE}"
bigquery_client.create_dataset(bigquery_dataset_id, exists_ok=True)
job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.CSV,
    skip_leading_rows=1,
    schema=[
        bigquery.SchemaField("text", "STRING"),
        bigquery.SchemaField("label", "STRING"),
    ],
)

with open(LOCAL_DATASET_PATH, "rb") as source_file:
    job = bigquery_client.load_table_from_file(source_file, bigquery_table_id, job_config=job_config)

result = job.result()

## 3. Configure access to BigQuery

To upload data from BigQuery to Cleanlab Studio, you need to enable the Cleanlab Studio GCP service account to access your BigQuery table. You can do this by adding the service account email to the IAM roles in your project.

This allows Cleanlab Studio to read data from your BigQuery table.

In [17]:
def add_read_access_to_bigquery_table(
    bigquery_client: bigquery.Client,
    bigquery_table_id: str,
    entity_id: str,
):
    """Adds read access to the given entity for the given BigQuery table."""
    # load IAM policy for table
    policy = bigquery_client.get_iam_policy(bigquery_table_id)

    # add dataViewer binding to policy
    policy.bindings.append(
        {
            "role": "roles/bigquery.dataViewer",
            "members": [f"serviceAccount:{entity_id}"],
        }
    )

    # update IAM policy for table
    bigquery_client.set_iam_policy(bigquery_table_id, policy)


cleanlab_studio_entity_id = "cleanlab-studio-bq-integration@cleanlab-studio-433118.iam.gserviceaccount.com"

# add read access to the BigQuery table 
add_read_access_to_bigquery_table(
    bigquery_client=bigquery_client,
    bigquery_table_id=bigquery_table_id,
    entity_id=cleanlab_studio_entity_id,
)

## 4. Upload data from BigQuery to Cleanlab Studio

Now that you have created a table in BigQuery and configured access, you can upload the data to Cleanlab Studio. You can use the `cleanlab-studio` Python package to upload the data.

After uploading the data, you can access it in Cleanlab Studio by opening the application and finding the dataset on the Dashboard (or clicking the link below).

In [None]:
# upload the dataset to Cleanlab Studio
dataset_id = studio.upload_from_bigquery(
    bigquery_project=GCP_PROJECT,
    bigquery_dataset_id=BIGQUERY_DATASET,
    bigquery_table_id=BIGQUERY_TABLE,
)

# view the dataset in Cleanlab Studio
print(f"https://app.cleanlab.ai/datasets/{dataset_id}")

## 5. Conclusion

In this tutorial, you learned how to upload data from BigQuery to Cleanlab Studio. You created a table in BigQuery, configured access, and uploaded the data using the `cleanlab-studio` Python package. You can now access your BigQuery data in Cleanlab Studio and use it to create projects. For next steps, check out our [Projects guide](/guide/concepts/projects).