# Step 0: Install preqrequisites and setup openAI API key


In [1]:
pip install openai tiktoken chromadb langchain BeautifulSoup4

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
os.environ["OPENAI_API_KEY"] = "sk-xxxxxxxxxxxxxxxxxxxxxxxx"

# Step 1. Loading the document content

In [3]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://bair.berkeley.edu/blog/2023/07/14/ddpo/")
data = loader.load()
data[0].page_content

'\n\n\n\n\nTraining Diffusion Models with  Reinforcement Learning – The Berkeley Artificial Intelligence Research Blog\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSubscribe\nAbout\nArchive\nBAIR\n\n\n\n\n\n\n\n\nTraining Diffusion Models with  Reinforcement Learning\n\nKevin Black \xa0\xa0\n  \n  \n  Jul 14, 2023\n  \n  \n\n\n\n\n\n\n\n\n\n\nTraining Diffusion Models with Reinforcement Learning\n\n\n\n\n\n\nreplay\n\n\nDiffusion models have recently emerged as the de facto standard for generating complex, high-dimensional outputs. You may know them for their ability to produce stunning AI art and hyper-realistic synthetic images, but they have also found success in other applications such as drug design\xa0and continuous control. The key idea behind diffusion models is to iteratively transform random noise into a sample, such as an image or protein structure. This is typically motivated as a maximum likelihood estimation\xa0problem, where the model is train

# Step 2: Split the document into fixed size chunks

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
all_splits = text_splitter.split_documents(data)
print(all_splits[1])

page_content='Diffusion models have recently emerged as the de facto standard for generating complex, high-dimensional outputs. You may know them for their ability to produce stunning AI art and hyper-realistic synthetic images, but they have also found success in other applications such as drug design\xa0and continuous control. The key idea behind diffusion models is to iteratively transform random noise into a sample, such as an image or protein structure. This is typically motivated as a maximum likelihood' metadata={'source': 'https://bair.berkeley.edu/blog/2023/07/14/ddpo/', 'title': 'Training Diffusion Models with  Reinforcement Learning – The Berkeley Artificial Intelligence Research Blog', 'description': 'The BAIR Blog', 'language': 'No language found.'}


# Step 3: Store the document chunks as vector embeddings

In [5]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

# Step 4: Query the vector store for related chunks based on similartiy search

In [6]:
question = "What steps are used for the DDPO algorithm?"
docs = vectorstore.similarity_search(question)
print(f"Retrieved {len(docs)} documents")
print(docs[0].page_content)

Retrieved 4 documents
The key insight of our algorithm, which we call denoising diffusion policy optimization (DDPO), is that we can better maximize the reward of the final sample if we pay attention to the entire sequence of denoising steps that got us there. To do this, we reframe the diffusion process as a multi-step Markov decision process (MDP). In MDP terminology: each denoising step is an action, and the agent only gets a reward on the final step of each denoising trajectory when the final sample is produced.


# Step 5: Pass the retrieved chunks through an LLM model to generate a response

In [7]:
from langchain.prompts import PromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

template = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Explain the answer in 3 sentences at max. Be concise.
Always say "Done!" at the end of the answer. 
{context}
Question: {question}
Answer:"""

prompt = PromptTemplate.from_template(template)

qa_chain = (
    {"context": retriever, "question": RunnablePassthrough()} 
    | prompt 
    | llm 
)

qa_chain.invoke("What steps are used for the DDPO algorithm?").content

'The steps used for the DDPO algorithm are reframing the diffusion process as a multi-step Markov decision process (MDP), applying policy gradient algorithms such as REINFORCE or importance sampled estimator, and evaluating the performance of DDPO on different reward functions. Done!'

In [8]:
qa_chain.invoke("What's the main idea disucssed in the article?").content

'The main idea discussed in the article is the training of diffusion models with reinforcement learning and the phenomenon of unexpected generalization in these models. The article highlights the lack of a general-purpose method for preventing overoptimization and the need for future work in this area. It also mentions the surprising generalization that occurs when finetuning large language models and text-to-image diffusion models. Done!'