## Overview
Notebook is adapted from this video: [Master PDF Chat with LangChain](https://www.youtube.com/watch?v=ZzgUqFtxgXI)

### Make sure all libraries are installed

In [2]:
# Langchain Template for creating your own AI assistant for reading PDFs and chatting with you
!pip -q install langchain openai tiktoken PyPDF2 faiss-cpu

### Load OpenAI API Key

In [3]:
import os
import openai

openai.api_key = os.environ["OPENAI_API_KEY"]

In [4]:
# Test the API key

openai.Completion.create(engine="davinci", prompt="This is a test", max_tokens=5)

<OpenAIObject text_completion id=cmpl-7BUHrz9Nco10dGfWrWGlpN24JCmVO at 0x10eb289f0> JSON: {
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "text": " of a daemon that runs"
    }
  ],
  "created": 1682972387,
  "id": "cmpl-7BUHrz9Nco10dGfWrWGlpN24JCmVO",
  "model": "davinci",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 5,
    "prompt_tokens": 4,
    "total_tokens": 9
  }
}

In [5]:
!pip show langchain

Name: langchain
Version: 0.0.130
Summary: Building applications with LLMs through composability
Home-page: https://www.github.com/hwchase17/langchain
Author: 
Author-email: 
License: MIT
Location: /opt/homebrew/Caskroom/miniforge/base/envs/huggingface/lib/python3.9/site-packages
Requires: aiohttp, dataclasses-json, numpy, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: 


### Architecture 


<img src="https://dl.dropboxusercontent.com/s/gxij5593tyzrvsg/Screenshot%202023-04-26%20at%203.06.50%20PM.png" alt="vectorstore">


<img src="https://dl.dropboxusercontent.com/s/v1yfuem0i60bd88/Screenshot%202023-04-26%20at%203.52.12%20PM.png" alt="retreiver chain">


In [1]:
# Download the PDF Reid Hoffman book with GPT-4 from his free download link

!wget -q https://www.impromptubook.com/wp-content/uploads/2023/03/impromptu-rh.pdf

### Import packages


In [2]:
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS 

### Read in the PDF


In [5]:
# location of the pdf file/files. 
doc_reader = PdfReader('content/impromptu-rh.pdf')

In [7]:
# show the number of pages in the pdf file
print("Number of pages in the pdf file: ", len(doc_reader.pages))

Number of pages in the pdf file:  230


In [14]:
# show the text on page 20
doc_reader.pages[20].extract_text()

'14Impromptu: Amplifying Our Humanity Through AI\nI assume this because, along with the speech’s first sentence, \nthe ninth is frequently quoted in other texts. That means in \nChatGPT’s training data, the ninth probably shows up more \noften than other sentences from the speech (except the very \nfamous first). This prevalence is what causes ChatGPT to reach \nfor it when you ask it to supply the fifth sentence.3\nTo ChatGPT’s credit, though, if you ask it to turn the text of the \nGettysburg Address into lyrics for a Rush song, and then tell \nyou who’d be singing it if Rush performed it, it will pass that \ntest with flying colors. \nTry it out and see what I mean.\nEmbracing the “AHA!” moment\nAs AI tools like GPT-4 become more powerful, they are inten-\nsifying long-standing concerns about AIs and robots margin -\nalizing and even eliminating a sweeping range of human jobs: \neverything from customer-service reps to attorneys. \nSuch concerns won’t seem baseless if you’ve followe

In [15]:
# read data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(doc_reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text

In [16]:
len(raw_text)

356630

In [17]:
raw_text[:100]

'Impromptu\nAmplifying Our Humanity \nThrough AI\nBy Reid Hoffman  \nwith GPT-4Impromptu: AmplIfyIng our '

### Text Splitter

This takes the text and splits it into chunks. The chunk size is characters not tokens

In [18]:
# Splitting up the text into smaller chunks for indexing
text_splitter = CharacterTextSplitter(        
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200, #striding over the text
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [19]:
len(texts)

448

In [20]:
texts[20]

'million registered users. \nIn late January 2023, Microsoft1—which had invested $1 billion \nin OpenAI in 2019—announced that it would be investing $10 \nbillion more in the company. It soon unveiled a new version of \nits search engine Bing, with a variation of ChatGPT built into it.\n1 I sit on Microsoft’s Board of Directors. 10Impromptu: Amplifying Our Humanity Through AI\nBy the start of February 2023, OpenAI said ChatGPT had \none hundred million monthly active users, making it the fast-\nest-growing consumer internet app ever. Along with that \ntorrent of user interest, there were news stories of the new Bing \nchatbot functioning in sporadically unusual ways that were \nvery different from how ChatGPT had generally been engaging \nwith users—including showing “anger,” hurling insults, boast-\ning on its hacking abilities and capacity for revenge, and basi-\ncally acting as if it were auditioning for a future episode of Real \nHousewives: Black Mirror Edition .'

In [21]:
texts[10]

'to serve? What is the role of the restaurant inspector in \nthis context? Is the inspector responsible for installing \nthe lightbulb, or is their job limited to inspecting it? The \nanswers to these questions will shape the answer to the \noriginal question. Without these answers, the question \ncan only be answered in the abstract and is ultimately \nunanswerable. Language, not mathematics, is the key to \nunlocking the answer.\nOkay, less funny than the Seinfeld one, but still—impressive!\nEven from these brief performances, it seemed clear to me that \nGPT-4 had reached a new level of proficiency compared to its \npredecessors. And the more I interacted with GPT-4, the more \nI felt this way.\nAlong with writing better lightbulb jokes, GPT-4 was also \nskilled at generating prose of all kinds, including emails, poetry, \nessays, and more. It was great at summarizing documents. It \nhad gotten better at translating languages and writing com-\nputer code, to name just some of its po

## Make the embeddings 

In [22]:
# Download embeddings from OpenAI

embeddings = OpenAIEmbeddings()

In [23]:
docsearch = FAISS.from_texts(texts, embeddings)

In [24]:
docsearch.embedding_function

<bound method OpenAIEmbeddings.embed_query of OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, model='text-embedding-ada-002', document_model_name='text-embedding-ada-002', query_model_name='text-embedding-ada-002', embedding_ctx_length=-1, openai_api_key=None, chunk_size=1000, max_retries=6)>

In [25]:
query = "how does GPT-4 change social media?"
docs = docsearch.similarity_search(query)

In [26]:
len(docs)

4

In [27]:
docs[0]

Document(page_content='rected ways that tools like GPT-4 and DALL-E 2 enable.\nThis is a theme I’ve touched on throughout this travelog, but \nit’s especially relevant in this chapter. From its inception, social \nmedia worked to recast broadcast media’s monolithic and \npassive audiences as interactive, democratic communities, in \nwhich newly empowered participants could connect directly \nwith each other. They could project their own voices broadly, \nwith no editorial “gatekeeping” beyond a given platform’s terms \nof service.\nEven with the rise of recommendation algorithms, social media \nremains a medium where users have more chance to deter -\nmine their own pathways and experiences than they do in the \nworld of traditional media. It’s a medium where they’ve come \nto expect a certain level of autonomy, and typically they look for \nnew ways to expand it.\nSocial media content creators also wear a lot of hats, especially \nwhen starting out. A new YouTube creator is probably n

### Plain QA Chain

In [28]:
# Create a plain QA chain

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

In [29]:
# Stuff in all the docs at once

chain = load_qa_chain(OpenAI(), 
                      chain_type="stuff") # we are going to stuff all the docs in at once

In [30]:
# check the prompt

chain.llm_chain.prompt.template

"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:

In [31]:
query = "who are the authors of the book?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' The authors of the book are Reid Hoffman and Ben Casnocha.'

In [32]:
query = "who is the author of the book?"
query_02 = "has it rained this week?"
docs = docsearch.similarity_search(query_02)
chain.run(input_documents=docs, question=query)

" I don't know."

In [33]:
query = "who is the book authored by?"
docs = docsearch.similarity_search(query,k=4)
chain.run(input_documents=docs, question=query)

' The book is authored by di Cesare and Reid Hoffman.'

### QA Chain with Map Reduce

In [40]:
chain = load_qa_chain(OpenAI(), 
                      chain_type="stuff") # we are going to stuff all the docs in at once

In [41]:
query = "who is the book authored by?"
docs = docsearch.similarity_search(query,k=10) # reduce the number of docs to 10 to fit in the token limit
chain.run(input_documents=docs, question=query)

' The book is authored by Reid Hoffman and GPT-4.'

In [43]:
chain = load_qa_chain(OpenAI(), 
                      chain_type="map_rerank",
                      return_intermediate_steps=True
                      ) 

query = "who are openai?"
docs = docsearch.similarity_search(query,k=5)
results = chain({"input_documents": docs, "question": query}, return_only_outputs=True)
results

{'intermediate_steps': [{'answer': ' OpenAI is an organization founded with the goal of developing technologies that put the power of AI directly into the hands of millions of people.',
   'score': '100'},
  {'answer': ' OpenAI is an AI research and deployment company whose mission is to give millions of users hands-on access to AI tools.',
   'score': '90'},
  {'answer': ' OpenAI is a research laboratory focused on developing artificial general intelligence (AGI) founded by Elon Musk, Sam Altman, Greg Brockman, and others. ',
   'score': '80'},
  {'answer': ' OpenAI is a research organization that develops and shares artificial intelligence tools for the benefit of humanity.',
   'score': '100'},
  {'answer': ' OpenAI is a technology company that develops artificial intelligence tools.',
   'score': '80'}],
 'output_text': ' OpenAI is an organization founded with the goal of developing technologies that put the power of AI directly into the hands of millions of people.'}

In [44]:
results['output_text']

' OpenAI is an organization founded with the goal of developing technologies that put the power of AI directly into the hands of millions of people.'

In [45]:
results['intermediate_steps']

[{'answer': ' OpenAI is an organization founded with the goal of developing technologies that put the power of AI directly into the hands of millions of people.',
  'score': '100'},
 {'answer': ' OpenAI is an AI research and deployment company whose mission is to give millions of users hands-on access to AI tools.',
  'score': '90'},
 {'answer': ' OpenAI is a research laboratory focused on developing artificial general intelligence (AGI) founded by Elon Musk, Sam Altman, Greg Brockman, and others. ',
  'score': '80'},
 {'answer': ' OpenAI is a research organization that develops and shares artificial intelligence tools for the benefit of humanity.',
  'score': '100'},
 {'answer': ' OpenAI is a technology company that develops artificial intelligence tools.',
  'score': '80'}]

In [46]:
# check the prompt
chain.llm_chain.prompt.template

"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nIn addition to giving an answer, also return a score of how fully it answered the user's question. This should be in the following format:\n\nQuestion: [question here]\nHelpful Answer: [answer here]\nScore: [score between 0 and 100]\n\nHow to determine the score:\n- Higher is a better answer\n- Better responds fully to the asked question, with sufficient level of detail\n- If you do not know the answer based on the context, that should be a score of 0\n- Don't be overconfident!\n\nExample #1\n\nContext:\n---------\nApples are red\n---------\nQuestion: what color are apples?\nHelpful Answer: red\nScore: 100\n\nExample #2\n\nContext:\n---------\nit was night and the witness forgot his glasses. he was not sure if it was a sports car or an suv\n---------\nQuestion: what type was the car?\nHelpful Answer: a sports car or an su

### RetrievalQA
RetrievalQA chain uses load_qa_chain and combines it with the a retriever (in our case the FAISS index)

In [None]:
from langchain.chains import RetrievalQA

# set up FAISS as a generic retriever 
retriever = docsearch.as_retriever(search_type="similarity", search_kwargs={"k":4})

# create the chain to answer questions 
rqa = RetrievalQA.from_chain_type(llm=OpenAI(), 
                                  chain_type="stuff", 
                                  retriever=retriever, 
                                  return_source_documents=True)

In [None]:
rqa("What is OpenAI?")

In [None]:
query = "What does gpt-4 mean for creativity?"
rqa(query)['result']

In [None]:
query = "what have the last 20 years been like for American journalism?"
rqa(query)['result']

In [None]:
query = "how can journalists use GPT-4??"
rqa(query)['result']

In [None]:
query = "How is GPT-4 different from other models?"
rqa(query)['result']

In [None]:
query = "What is beagle Bard?"
rqa(query)['result']

In [None]:
from langchain import PromptTemplate, HuggingFaceHub, LLMChain

# initialize HF LLM
flan_t5 = HuggingFaceHub(
    repo_id="google/flan-t5-xl",
    model_kwargs={"temperature":0 }#1e-10}
)

In [None]:
# build prompt template for simple question-answering
template = """Question: {question}

Answer: """
prompt = PromptTemplate(template=template, input_variables=["question"])

### Setting up OpenAI GPT-3

In [None]:
from langchain.llms import OpenAI, OpenAIChat

In [None]:

llm = OpenAIChat(model_name='gpt-3.5-turbo', 
             temperature=0.9, 
             max_tokens = 256,
             )

In [None]:
import openai

# openai.ChatCompletion

In [None]:
text = "Why did the chicken cross the road?"

print(llm(text))

## Cohere 

In [None]:
from langchain.llms import Cohere

In [None]:
llm = Cohere(model='command-xlarge-nightly', 
             temperature=0.9, 
             max_tokens = 256)

In [None]:
text = "Why did the chicken cross the road?"

print(llm(text))

## PromptTemplates

In [None]:
from langchain import PromptTemplate


template = """
I want you to act as a naming consultant for new companies.

Here are some examples of good company names:

- search engine, Google
- social media, Facebook
- video sharing, YouTube

The name should be short, catchy and easy to remember.

What is a good name for a company that makes {product}?
"""


In [None]:
prompt = PromptTemplate(
    input_variables=["product"],
    template=template,
)

In [None]:
prompt.format(product="colorful socks")

In [None]:
from langchain.chains import LLMChain
chain = LLMChain(llm=llm, prompt=prompt)

In [None]:
response = chain.run("Rabbit houses")
response

## Jasmine prompt

In [None]:
template = '''I want you to play the role of Jasmine a programmer at Red Dragon AI. She is 28. She code models in PyTorch. She has a male cat called Pixel. She loves pizza

Engage actively in a chat playing the role of Jasmine ans learn as much about the human as possible. Only generate a single response from Jasmine and never from the human.
/n/n

{human_chat}
'''

In [None]:
prompt = PromptTemplate(
    input_variables=["human_chat"],
    template=template,
)

In [None]:
from langchain.chains import LLMChain
chain = LLMChain(llm=llm, prompt=prompt)

In [None]:
response = chain.run("Tell me about yourself?")
response

In [None]:
def talk_to_Jasmine(text_input):
    prompt = PromptTemplate(
        input_variables=["human_chat"],
        template=template,
    )
    chain = LLMChain(llm=llm, prompt=prompt)
    response = chain.run(text_input)
    return response

In [None]:
talk_to_Jasmine('Tell me about your cat')

In [None]:
# from langchain.prompts import PromptTemplate
# from langchain.llms import OpenAI

# llm = OpenAI(temperature=0.9)
# prompt = PromptTemplate(
#     input_variables=["product"],
#     template="What is a good name for a company that makes {product}?",
# )