In [47]:
# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License

# Generate text embeddings by using OpenAI models

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/open_ai_text_embeddings.ipynb"><img src="https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/open_ai_text_embeddings.ipynb"><img src="https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png" />View source on GitHub</a>
  </td>
</table>



Use text embeddings to represent text as numerical vectors. This process lets computers understand and process text data, which is essential for many natural language processing (NLP) tasks.

The following NLP tasks use embeddings:

* **Semantic search:** Find documents or passages that are relevant to a query when the query doesn't use the exact same words as the documents.
* **Text classification:** Categorize text data into different classes, such as spam and not spam, or positive sentiment and negative sentiment.
* **Machine translation:** Translate text from one language to another and preserve the meaning.
* **Text summarization:** Create shorter summaries of text.

This notebook uses Apache Beam's `MLTransform` to generate embeddings from text data using OpenAI's embedding models.

OpenAI provides powerful embedding models like `text-embedding-3-small` and `text-embedding-3-large` that can generate high-quality text embeddings. These models support configurable dimensions, allowing you to balance between embedding quality and storage/computation costs.

To generate text embeddings using OpenAI models with `MLTransform`, use the `OpenAITextEmbeddings` module to specify the model configuration.


## Install dependencies

Install Apache Beam and the dependencies needed to work with OpenAI embeddings.

In [48]:
! pip install 'apache_beam[interactive,openai]>=2.71.0' --quiet

[0m

In [49]:
import logging
logging.getLogger().setLevel(logging.ERROR)

import tempfile
import apache_beam as beam
from apache_beam.ml.transforms.base import MLTransform
from apache_beam.ml.transforms.embeddings.open_ai import OpenAITextEmbeddings

## Set up your OpenAI API key

To use OpenAI's embedding models, you need an API key. You can get one from [OpenAI's platform](https://platform.openai.com/api-keys).

Set the `OPENAI_API_KEY` environment variable before running this notebook.

In [50]:
import os

OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')

## Process the data

`MLTransform` is a `PTransform` that you can use for data preparation, including generating text embeddings.

### Use MLTransform in write mode

In `write` mode, `MLTransform` saves the transforms and their attributes to an artifact location. Then, when you run `MLTransform` in `read` mode, these transforms are used. This process ensures that you're applying the same preprocessing steps when you train your model and when you serve the model in production or test its accuracy.

For more information about using `MLTransform`, see [Preprocess data with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in the Apache Beam documentation.

### Get the data

The following text inputs come from the Hugging Face blog [Getting Started With Embeddings](https://huggingface.co/blog/getting-started-with-embeddings).


`MLTransform` operates on dictionaries of data. To generate embeddings for specific columns, provide the column names as input to the `columns` argument in the `OpenAITextEmbeddings` class.

In [51]:
content = [
    {'x': 'Apache Beam is an open source unified programming model.'},
    {'x': 'It allows you to define both batch and streaming data pipelines.'},
    {'x': 'Beam provides a portable API layer for building pipelines.'},
    {'x': 'You can run Beam pipelines on multiple execution engines.'},
    {'x': 'Runners execute pipelines on distributed processing backends.'},
]

# Using text-embedding-3-small model - a cost-effective option.
# Other options: text-embedding-3-large, text-embedding-ada-002
text_embedding_model_name = 'text-embedding-3-small'

# helper function that returns a dict containing only first
# ten elements of generated embeddings.
def truncate_embeddings(d):
  for key in d.keys():
    d[key] = d[key][:10]
  return d


### Generate text embeddings
This example uses the model `text-embedding-3-small` to generate text embeddings. For more information about OpenAI embedding models, see [OpenAI's embeddings documentation](https://platform.openai.com/docs/guides/embeddings).

The `text-embedding-3-small` model produces 1536-dimensional embeddings by default, but you can use the `dimensions` parameter to reduce this.

In [52]:
artifact_location = tempfile.mkdtemp(prefix='openai_')
embedding_transform = OpenAITextEmbeddings(
        model_name=text_embedding_model_name,
        columns=['x'],
        api_key=OPENAI_API_KEY)

with beam.Pipeline() as pipeline:
  data_pcoll = (
      pipeline
      | "CreateData" >> beam.Create(content))
  transformed_pcoll = (
      data_pcoll
      | "MLTransform" >> MLTransform(write_artifact_location=artifact_location).with_transform(embedding_transform))

  transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >> beam.Map(print)

  transformed_pcoll | "PrintEmbeddingShape" >> beam.Map(lambda x: print(f"Embedding shape: {len(x['x'])}"))

Embedding shape: 1536
{'x': [-0.009753434918820858, 0.0038477969355881214, 0.03195132687687874, -0.003576868213713169, 0.024280086159706116, -0.003038055030629039, -0.009223753586411476, -0.004340948071330786, 0.018362272530794144, -0.05050842463970184]}
Embedding shape: 1536
{'x': [0.012837349437177181, -0.024922415614128113, 0.04935948923230171, -0.029872925952076912, 0.007735170423984528, -0.03977394476532936, 0.010453097522258759, 0.015591677278280258, -0.031911369413137436, 0.016720101237297058]}
Embedding shape: 1536
{'x': [0.00837371964007616, -0.02793511003255844, 0.034448761492967606, -0.007256315555423498, 0.01797385886311531, -0.014417242258787155, -0.026722317561507225, -0.022675134241580963, 0.024800928309559822, 0.0005595539114437997]}
Embedding shape: 1536
{'x': [0.018137292936444283, 0.0439058393239975, 0.08075538277626038, -0.06779270619153976, -0.004390583839267492, -0.0013206052826717496, -0.01129007339477539, 0.009728540666401386, -0.036117780953645706, -0.013060680

### Configuring embedding dimensions

OpenAI's newer embedding models (`text-embedding-3-small` and `text-embedding-3-large`) support the `dimensions` parameter, which allows you to shorten embeddings without losing too much accuracy. This can help reduce storage costs and improve computation speed.

For example, you can reduce the dimensions from 1536 to 256:

In [53]:
artifact_location_with_dims = tempfile.mkdtemp(prefix='openai_dims_')

embedding_transform_with_dims = OpenAITextEmbeddings(
        model_name=text_embedding_model_name,
        columns=['x'],
        api_key=OPENAI_API_KEY,
        dimensions=256  # Reduce embedding dimensions
        )

with beam.Pipeline() as pipeline:
  data_pcoll = (
      pipeline
      | "CreateData" >> beam.Create(content))
  transformed_pcoll = (
      data_pcoll
      | "MLTransform" >> MLTransform(write_artifact_location=artifact_location_with_dims).with_transform(embedding_transform_with_dims))

  transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >> beam.Map(print)

  transformed_pcoll | "PrintEmbeddingShape" >> beam.Map(lambda x: print(f"Embedding shape: {len(x['x'])}"))

Embedding shape: 256
{'x': [-0.020514478906989098, 0.00809310283511877, 0.06720348447561264, -0.00752325588837266, 0.0510685034096241, -0.006389965768903494, -0.019400397315621376, -0.00913035124540329, 0.03862151503562927, -0.10623478144407272]}
Embedding shape: 256
{'x': [0.025633791461586952, -0.049765415489673615, 0.09856168925762177, -0.05965065583586693, 0.015445691533386707, -0.07942114025354385, 0.02087288349866867, 0.03113366849720478, -0.06372105330228806, 0.033386923372745514]}
Embedding shape: 256
{'x': [0.017627887427806854, -0.058807432651519775, 0.07251960784196854, -0.015275589190423489, 0.03783756121993065, -0.030350372195243835, -0.05625433102250099, -0.047734424471855164, 0.0522095263004303, 0.0011779415654018521]}
Embedding shape: 256
{'x': [0.03754354268312454, 0.09088350087404251, 0.1671607345342636, -0.14032845199108124, -0.009088350459933281, -0.0027336054481565952, -0.0233700443059206, 0.02013772912323475, -0.0747625008225441, -0.02703513763844967]}
Embedding s

### Use MLTransform in read mode

In `read` mode, `MLTransform` uses the artifacts generated during `write` mode. In this case, the `OpenAITextEmbeddings` transform and its attributes are loaded from the saved artifacts. You don't need to specify the artifacts again during `read` mode.

In this way, `MLTransform` provides consistent preprocessing steps for training and inference workloads.

In [54]:
test_content = [
    {'x': 'What runners does Apache Beam support?'},
    {'x': 'How do I create a streaming pipeline?'},
    {'x': 'A PCollection represents a distributed dataset.'},
]

# Uses the saved artifacts to generate text embeddings.
with beam.Pipeline() as pipeline:
  data_pcoll = (
      pipeline
      | "CreateData" >> beam.Create(test_content))
  transformed_pcoll = (
      data_pcoll
      | "MLTransform" >> MLTransform(read_artifact_location=artifact_location))

  transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >> beam.Map(print)

{'x': [0.00047562166582793, 0.02407737262547016, 0.10070080310106277, -0.030040575191378593, 0.011932645924389362, -0.030789095908403397, 0.04226639121770859, 0.033109504729509354, 0.021158145740628242, -0.01302423607558012]}
{'x': [0.007936494424939156, -0.01662219502031803, 0.03890787065029144, -0.03299042209982872, -0.011180933564901352, -0.041066598147153854, 0.04319992661476135, -0.009568237699568272, -0.03382851555943489, 0.03431105241179466]}
{'x': [0.022309930995106697, -0.03803618252277374, 0.04467129707336426, 0.023711536079645157, 0.014851856976747513, 0.0012834640219807625, -0.00548747181892395, -0.005561409518122673, -0.023698676377534866, -0.018503742292523384]}


# Next Steps

Now that you've generated embeddings, you can use MLTransform and Sinks to ingest your data into a Vector Database. For this, along with more advanced concepts, check out the following notebooks:

- [Vector Embedding Ingestion with Apache Beam and AlloyDB](https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/alloydb_product_catalog_embeddings.ipynb)
- [Embedding Ingestion and Vector Search with Apache Beam and BigQuery](https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/bigquery_vector_ingestion_and_search.ipynb)
- [Vector Embedding Ingestion with Apache Beam and CloudSQL Postgres](https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/cloudsql_postgres_product_catalog_embeddings.ipynb)