# OpenAI

See:

* [Welcome to the OpenAI platform](https://platform.openai.com/overview)
* [OpenAI Introduction](https://platform.openai.com/docs/introduction/overview)
* [OpenAI Quickstart](https://platform.openai.com/docs/quickstart)
* [Best practices for prompt engineering with OpenAI API](https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api)
* [Usage Dashboard](https://platform.openai.com/account/usage)
* https://github.com/openai/tiktoken
* [Billing settings](https://platform.openai.com/account/billing/overview)

See also: 

* [Awesome ChatGPT Prompts](https://github.com/f/awesome-chatgpt-prompts)

In [1]:
%pip install chromadb==0.3.21 tiktoken==0.3.3




[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd

talks = pd.read_csv("dais23_talks.csv")
talks.head()

Unnamed: 0,Title,Abstract
0,Nebula: The Journey of Scaling Instacart’s Dat...,Instacart has gone through immense growth duri...
1,Satellite Imaginary Data Processing Using Apac...,Agriculture is a complex ecosystem. Understand...
2,From Snowflake to Enterprise-Scale Apache Spark™,Akamai mPulse is a real user monitoring (RUM) ...
3,The Future of Data Orchestration: Asset-Based ...,Data orchestration is a core component for any...
4,Photon for Dummies: How Does this New Executio...,Did you finish the Photon whitepaper and think...


In [3]:
talks["full_text"] = talks.apply(
    lambda row: f"""Title: {row["Title"]} Abstract: {row["Abstract"]}""".strip(),
    axis=1,
)
print(talks.iloc[0]["full_text"])

Title: Nebula: The Journey of Scaling Instacart’s Data Pipelines with Apache Spark™ and Lakehouse Abstract: Instacart has gone through immense growth during the pandemic and the trend continues. Instacart ads is no exception in this growth story. We have launched many new product lines including display and video ads covering the full advertising funnel to address the increasing demand of our retail partners. We have built advanced models to auto-suggest optimal bidding to increase the ROI for our CPG partners. Advertisers’ trust is the utmost priority and thus the quest to build a top-class ads measurement platform. Ads data processing requires complex data verifications to update ads serving stats. In ETL pipelines these were implemented through files containing thousands of lines of raw SQL which were hard to scale, test, and iterate upon. Our data engineers used to spend hours testing small changes due to a lack of local testing mechanisms. These pain points stress our need for bet

In [4]:
import chromadb
from chromadb.config import Settings

# See https://docs.trychroma.com/usage-guide#initiating-the-chroma-client
chroma_client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="~/test/chroma",  
))

Using embedded DuckDB with persistence: data will be stored in: ~/test/chroma


In [7]:
collection_name = "talks"

# If you have created the collection before, you need to delete the collection first.
# See https://docs.trychroma.com/usage-guide#using-collections
if len(chroma_client.list_collections()) > 0 \
    and collection_name in [chroma_client.list_collections()[0].name]:
    chroma_client.delete_collection(name=collection_name)
else:
    print(f"Creating collection: '{collection_name}'")
    talks_collection = chroma_client.create_collection(name=collection_name)

No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction


Creating collection: 'talks'


  from .autonotebook import tqdm as notebook_tqdm


In [9]:
# See https://docs.trychroma.com/usage-guide#adding-data-to-a-collection
talks_list = talks["full_text"].tolist()

talks_collection.add(
    documents=talks_list,
    ids=[f"id{x}" for x in range(len(talks_list))],
)

In [10]:
import json

results = talks_collection.query(query_texts=["Spark"], n_results=10)
print(json.dumps(results, indent=4))

{
    "ids": [
        [
            "id15",
            "id12",
            "id4",
            "id105",
            "id160",
            "id2",
            "id174",
            "id99",
            "id125",
            "id176"
        ]
    ],
    "embeddings": null,
    "documents": [
        [
            "Title: Deep Dive into the New Features of Apache Spark\u2122 3.4 Abstract: In 2022, Apache Spark\u2122 was awarded the prestigious SIGMOD Systems Award, because Spark is the de facto standard for data processing. In this talk, we want to share the latest progress in Apache Spark community. With tremendous contribution from the open-source community, Spark 3.4 managed to resolve in excess of 2,400 Jira tickets. We will talk about the major features and improvements in Spark 3.4. The major updates are Spark Connect, numerous PySpark and SQL language features, engine performance enhancements, as well as operational improvements in Spark UX and error handling.",
            "Title: Use

In [11]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
lm_model = AutoModelForCausalLM.from_pretrained(model_id)

pipe = pipeline(
    "text-generation", model=lm_model, tokenizer=tokenizer, max_new_tokens=512, device_map="auto", 
    handle_long_generation="hole"
)

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [12]:
question = "Help me find sessions related to Spark."
context = " ".join([f"#{str(i)}" for i in results["documents"][0]])
requirement = "Recommend top-5 relevant sessions for me to attend."
prompt_template = f"Relevant context: {context}\n\n The user's question: {question} {requirement}"

In [13]:
lm_response = pipe(prompt_template)
print(lm_response[0]["generated_text"])

Token indices sequence length is longer than the specified maximum sequence length for this model (2500 > 1024). Running this sequence through the model will result in indexing errors
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Relevant context: #Title: Deep Dive into the New Features of Apache Spark™ 3.4 Abstract: In 2022, Apache Spark™ was awarded the prestigious SIGMOD Systems Award, because Spark is the de facto standard for data processing. In this talk, we want to share the latest progress in Apache Spark community. With tremendous contribution from the open-source community, Spark 3.4 managed to resolve in excess of 2,400 Jira tickets. We will talk about the major features and improvements in Spark 3.4. The major updates are Spark Connect, numerous PySpark and SQL language features, engine performance enhancements, as well as operational improvements in Spark UX and error handling. #Title: Use Apache Spark™ from Anywhere: Remote connectivity with Spark Connect Abstract: Over the past decade, developers, researchers, and the community at large have successfully built tens of thousands of data applications using Apache Spark™. Since then, use cases and requirements of data applications have evolved. Toda

## Use OpenAI models for Q/A

For this section to work, you need to generate an Open AI key. 

Steps:
1. You need to [create an account](https://platform.openai.com/signup) on OpenAI. 
2. Generate an OpenAI [API key here](https://platform.openai.com/account/api-keys). 

Note: OpenAI does not have a free option, but it gives you \\$5 as credit. Once you have exhausted your \\$5 credit, you will need to add your payment method. You will be [charged per token usage](https://openai.com/pricing). **IMPORTANT**: It's crucial that you keep your OpenAI API key to yourself. If others have access to your OpenAI key, they will be able to charge their usage to your account!

In [15]:
import os
import openai

os.environ["OPENAI_API_KEY"] = "XXXXXXXXX"
openai.api_key = os.environ["OPENAI_API_KEY"]

In [16]:
import tiktoken

price_token = 0.002
encoder = tiktoken.encoding_for_model("gpt-3.5-turbo")
cost_to_run = len(encoder.encode(prompt_template)) / 1000 * price_token
print(f"It would take roughly ${round(cost_to_run, 5)} to run this prompt")

It would take roughly $0.0044 to run this prompt


In [17]:
# TODO
gpt35_response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt_template},
    ],
    temperature=0, # 0 makes outputs deterministic; The closer the value is to 1, the more random the outputs are for each time you re-run.
)

In [18]:
print(gpt35_response.choices[0]["message"]["content"])

Sure, here are the top 5 relevant sessions related to Spark that you may want to attend:

1. Deep Dive into the New Features of Apache Spark™ 3.4
2. Use Apache Spark™ from Anywhere: Remote connectivity with Spark Connect
3. Photon for Dummies: How Does this New Execution Engine Actually Work?
4. How Disney+ uses Amazon Kinesis and Databricks to Deliver Personalized Customer Experience
5. An API for DL Inferencing on Spark

These sessions cover a range of topics related to Spark, including new features and improvements, remote connectivity, execution engine, real-time data processing, and deep learning inferencing.


In [19]:
from IPython.display import Markdown

Markdown(gpt35_response.choices[0]["message"]["content"])

Sure, here are the top 5 relevant sessions related to Spark that you may want to attend:

1. Deep Dive into the New Features of Apache Spark™ 3.4
2. Use Apache Spark™ from Anywhere: Remote connectivity with Spark Connect
3. Photon for Dummies: How Does this New Execution Engine Actually Work?
4. How Disney+ uses Amazon Kinesis and Databricks to Deliver Personalized Customer Experience
5. An API for DL Inferencing on Spark

These sessions cover a range of topics related to Spark, including new features and improvements, remote connectivity, execution engine, real-time data processing, and deep learning inferencing.

In [20]:
gpt35_response["usage"]["total_tokens"]

2355