# Playing with OpenAI
A simple example on how to read text from a PDF file and combine the chunks of texts with the embeddings from OpenAI
into Facebook AI Similarity Search (Faiss).


## Sources: 
https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/faiss.html
https://platform.openai.com/docs/api-reference/models


## Setup
Clone the project.
Run the commands: 
```console
python -m venv env
source env/bin/activate
pip install -r requirements.txt
echo 'OPENAI_API_KEY=' >.env
```



In [10]:
import os
from dotenv import load_dotenv
from pypdf import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI


In [2]:
# Create a .env file in the root directory of the project
from dotenv import load_dotenv
load_dotenv()

# Load the API key from the .env file
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if OPENAI_API_KEY is None:
    raise ValueError("Please set your OPENAI_API_KEY in the .env file")


In [3]:
reader = PdfReader("data/bitcoin.pdf")
number_of_pages = len(reader.pages)
raw_text = ""
for idx, page in enumerate(reader.pages):
    text = page.extract_text()
    if text is not None:
        raw_text += text
        if idx % 10 == 0 and idx > 0:
            print(f"Processed {idx} pages of {number_of_pages}")
    else:
        print(f"Page {idx} is empty")

# Split the text into smaller chunks to avoid hitting the API/token limits
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)

texts = text_splitter.split_text(raw_text)


In [None]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model="gpt-4", openai_api_type="open_ai")

In [5]:
docsearch = FAISS.from_texts(texts, embeddings)

In [7]:
chain = load_qa_chain(OpenAI(), chain_type="stuff")

In [9]:
query = "What is Bitcoin?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' Bitcoin is a peer-to-peer electronic cash system which allows online payments to be sent directly from one party to another without going through a financial institution. It is secured by a peer-to-peer network using proof-of-work to record a public history of transactions, making it computationally impractical for an attacker to change if honest nodes control a majority of computing power.'