# Spark and OpenAI Integration Example

This notebook demonstrates how to integrate Apache Spark with OpenAI's API to perform token counting, embedding generation, and multilingual translation using Spark UDFs.

In [None]:
# Initialize Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

## Create Dummy Data

Create a simple DataFrame containing names of fruits.

In [None]:
# Create DataFrame with fruit names
fruit_data = [("apple",), ("banana",), ("cherry",), ("mango",), ("orange",), ("peach",), ("pear",), ("pineapple",), ("plum",), ("strawberry",)]
df = spark.createDataFrame(fruit_data, ["name"])
df.createOrReplaceTempView("fruits")

In [None]:
# Display the fruits DataFrame
spark.sql("select * from fruits").show()

                                                                                

+----------+
|      name|
+----------+
|     apple|
|    banana|
|    cherry|
|     mango|
|    orange|
|     peach|
|      pear|
| pineapple|
|      plum|
|strawberry|
+----------+



## Count Tokens

Use OpenAI's GPT model to count the number of tokens in each fruit name.

In [None]:
# Register UDF to count tokens using OpenAI GPT model
from openaivec.spark import count_tokens_udf

spark.udf.register("count_tokens", count_tokens_udf("gpt-4o"))

In [None]:
# Show token counts for each fruit name
spark.sql("""
    select
        name,
        count_tokens(name) as token_count
    from fruits
""").show()

                                                                                

+----------+-----------+
|      name|token_count|
+----------+-----------+
|     apple|          1|
|    banana|          1|
|    cherry|          2|
|     mango|          2|
|    orange|          1|
|     peach|          2|
|      pear|          1|
| pineapple|          2|
|      plum|          2|
|strawberry|          3|
+----------+-----------+



                                                                                

## Generate Embeddings

Generate embeddings for each fruit name using OpenAI's embedding model.

In [None]:
# Register UDF to generate embeddings
import os
from openaivec.spark import UDFBuilder

udf = UDFBuilder.of_openai(
    api_key=os.getenv("OPENAI_API_KEY"),
    model_name="text-embedding-3-small"
)

spark.udf.register("embed", udf.embedding())

In [None]:
# Display embeddings for each fruit name
spark.sql("""
    select
        name,
        embed(name) as embedding
    from fruits
""").show()



+----------+--------------------+
|      name|           embedding|
+----------+--------------------+
|     apple|[0.017625168, -0....|
|    banana|[0.008520956, -0....|
|    cherry|[0.036192853, -0....|
|     mango|[0.055479053, -0....|
|    orange|[-0.025900742, -0...|
|     peach|[0.030666627, -0....|
|      pear|[0.023707643, -0....|
| pineapple|[0.021006476, -0....|
|      plum|[0.004915137, 5.4...|
|strawberry|[0.020135976, -0....|
+----------+--------------------+



                                                                                

## Multilingual Translation

Translate fruit names into multiple languages using OpenAI's GPT model.

In [None]:
# Register UDF for multilingual translation
import os
from pydantic import BaseModel
from openaivec.spark import UDFBuilder

udf = UDFBuilder.of_openai(
    api_key=os.getenv("OPENAI_API_KEY"),
    model_name="gpt-4o-mini",
)

class Translation(BaseModel):
    en: str
    fr: str
    ja: str
    es: str
    de: str
    it: str
    pt: str
    ru: str

spark.udf.register("translate", udf.completion(
    system_message="Translate the following text to English, French, Japanese, Spanish, German, Italian, Portuguese, and Russian.",
    response_format=Translation,
))

In [None]:
# Display translations for each fruit name
spark.sql("""
    select
        name,
        translate(name) as t,
        t.en as en,
        t.fr as fr,
        t.ja as ja,
        t.es as es,
        t.de as de,
        t.it as it,
        t.pt as pt,
        t.ru as ru
    from fruits
""").show()



+----------+-----------------------+----------+------+------------+-------+--------+--------+-------+--------+
|      name|                      t|        en|    fr|          ja|     es|      de|      it|     pt|      ru|
+----------+-----------------------+----------+------+------------+-------+--------+--------+-------+--------+
|     apple| {apple, pomme, リン...|     apple| pomme|      リンゴ|manzana|   Apfel|    mela|   maçã|  яблоко|
|    banana|   {banana, banane, ...|    banana|banane|      バナナ|plátano|  Banane|  banana| banana|   банан|
|    cherry|   {cherry, cerise, ...|    cherry|cerise|  さくらんぼ| cereza| Kirsche|ciliegia| cereja|   вишня|
|     mango|  {mango, mangue, マ...|     mango|mangue|    マンゴー|  mango|   Mango|   mango|  manga|   манго|
|    orange|   {orange, orange, ...|    orange|orange|    オレンジ|naranja|  Orange| arancia|laranja|апельсин|
|     peach|  {peach, pêche, 桃,...|     peach| pêche|          桃|durazno|Pfirsich|   pesca|pêssego|  персик|
|      pear|  {pear, poir

                                                                                

## Conclusion

This notebook illustrated how to effectively integrate Apache Spark with OpenAI's API for various NLP tasks such as token counting, embedding generation, and multilingual translation.