# Use BigQuery DataFrames to run Anthropic LLM at scale

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/anniexudan/bqtoclaude/blob/main/Python_Notebook_Sample/BigFrames%2BClaude_Remote_Function.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/anniexudan/bqtoclaude/blob/main/Python_Notebook_Sample/BigFrames%2BClaude_Remote_Function.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>                                                                                               
</table>

# Overview

Anthropic Claude models are available as APIs on Vertex AI ([docs](https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-claude)).

To run the Claude models at large scale data we can utilze the BigQuery
DataFrames remote functions ([docs](https://cloud.google.com/bigquery/docs/use-bigquery-dataframes#remote-functions)).
BigQuery DataFrames provides a simple pythonic interface `remote_function` to
deploy the user code as a BigQuery remote function and then invoke it at scale
by utilizing the parallel distributed computing architecture of BigQuery and
Google Cloud Function.

In this notebook we showcase one such example. For the demonstration purpose we
use a small amount of data, but the example generalizes for large data. Check out
various IO APIs provided by BigQuery DataFrames [here](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.pandas#bigframes_pandas_read_gbq)
to see how you could create a DataFrame from your Big Data sitting in a BigQuery
table or GCS bucket.

# Set Up

## Set up a claude model in Vertex

https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-claude#before_you_begin

## Install Anthropic with Vertex if needed

Uncomment the following cell and run the cell to install anthropic python
package with vertex extension if you don't already have it.

In [6]:
# !pip install anthropic[vertex] --quiet

## Define project and location for GCP integration

In [1]:
PROJECT = ""
LOCATION = ""

# Initialize BigQuery DataFrames dataframe

BigQuery DataFrames is a set of open source Python libraries that let you take
advantage of BigQuery data processing by using familiar Python APIs.
See for more details https://cloud.google.com/bigquery/docs/bigquery-dataframes-introduction.

In [2]:
# Import BigQuery DataFrames pandas module and initialize it with your project
# and location

import bigframes.pandas as bpd
bpd.options.bigquery.project = PROJECT
bpd.options.bigquery.location = LOCATION

Let's use a DataFrame with small amount of inline data for demo purpose.
You could create a DataFrame from your own data. See APIs like `read_gbq`,
`read_csv`, `read_json` etc. at https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.pandas.

In [3]:
df = bpd.DataFrame({"questions": [
   "What is the capital of France?",
   "Explain the concept of photosynthesis in simple terms.",
   "Write a haiku about artificial intelligence."
 ]})
df

Unnamed: 0,questions
0,What is the capital of France?
1,Explain the concept of photosynthesis in simpl...
2,Write a haiku about artificial intelligence.


# Use BigQuery DataFrames `remote_function`

Let's create a remote function from a custom python function that takes a prompt
and returns the output of the claude LLM running in Vertex. We will be using
`max_batching_rows=1` to control parallelization. This ensures that a single
prompt is processed per batch in the underlying cloud function so that the batch
processing does not time out. An ideal value for `max_batching_rows` depends on
the complexity of the prompts in the real use case and should be discovered
through offline experimentation. Check out the API for other ways to control
parallelization https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.pandas#bigframes_pandas_remote_function.

In [7]:
@bpd.remote_function(packages=["anthropic[vertex]"], max_batching_rows=1)
def anthropic_transformer(message: str) -> str:
  from anthropic import AnthropicVertex
  client = AnthropicVertex(region=LOCATION, project_id=PROJECT)

  message = client.messages.create(
              max_tokens=1024,
              messages=[
                  {
                      "role": "user",
                      "content": message,
                  }
              ],
              model="claude-3-5-sonnet@20240620",
          )
  content_text = message.content[0].text if message.content else ""
  return content_text

In [None]:
# Print the BigQuery remote function created
anthropic_transformer.bigframes_remote_function

In [None]:
# Print the cloud function created
anthropic_transformer.bigframes_cloud_function

In [None]:
# Apply the remote function on the user data
df["answers"] = df["questions"].apply(anthropic_transformer)
df