## OpenAI vs. Local Embeddings
Performance Comparison
- OpenAI's Embedding Model
- InstructorEmbedding (https://huggingface.co/hkunlp/instructor-xl)

In [None]:
!pip -q install langchain openai tiktoken chromadb pypdf sentence_transformers InstructorEmbedding faiss-cpu

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "YOUR-API-KEY"

In [None]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader

In [None]:
# InstructorEmbedding
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings

  from tqdm.autonotebook import trange


In [None]:
# OpenAI Embedding
from langchain.embeddings import OpenAIEmbeddings

### Load Multiple files from Directory

In [None]:
# connect your Google Drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive"

Mounted at /content/gdrive


In [None]:
# loader = TextLoader('single_text_file.txt')
loader = DirectoryLoader(f'{root_dir}/Documents/', glob="./*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()

In [None]:
# documents

### Divide and Conquer

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
                                               chunk_size=1000,
                                               chunk_overlap=200)

texts = text_splitter.split_documents(documents)

In [None]:
texts[0]

Document(page_content='GPT4All: Training an Assistant-style Chatbot with Large Scale Data\nDistillation from GPT-3.5-Turbo\nYuvanesh Anand\nyuvanesh@nomic.aiZach Nussbaum\nzanussbaum@gmail.com\nBrandon Duderstadt\nbrandon@nomic.aiBenjamin Schmidt\nben@nomic.aiAndriy Mulyar\nandriy@nomic.ai\nAbstract\nThis preliminary technical report describes the\ndevelopment of GPT4All, a chatbot trained\nover a massive curated corpus of assistant in-\nteractions including word problems, story de-\nscriptions, multi-turn dialogue, and code. We\nopenly release the collected data, data cura-\ntion procedure, training code, and final model\nweights to promote open research and repro-\nducibility. Additionally, we release quantized\n4-bit versions of the model allowing virtually\nanyone to run the model on CPU.\n1 Data Collection and Curation\nWe collected roughly one million prompt-\nresponse pairs using the GPT-3.5-Turbo OpenAI\nAPI between March 20, 2023 and March 26th,\n2023. To do this, we first gat

In [None]:
len(texts)

22

### Get Embeddings for OUR Documents

In [None]:
# !pip install faiss-cpu

In [None]:
import pickle
import faiss
from langchain.vectorstores import FAISS

In [None]:
def store_embeddings(docs, embeddings, sotre_name, path):

    vectorStore = FAISS.from_documents(docs, embeddings)

    with open(f"{path}/faiss_{sotre_name}.pkl", "wb") as f:
        pickle.dump(vectorStore, f)

In [None]:
def load_embeddings(sotre_name, path):
    with open(f"{path}/faiss_{sotre_name}.pkl", "rb") as f:
        VectorStore = pickle.load(f)
    return VectorStore

### HF Instructor Embeddings

In [None]:
from langchain.embeddings import HuggingFaceInstructEmbeddings

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl",
                                                      model_kwargs={"device": "cuda"})

load INSTRUCTOR_Transformer
max_seq_length  512


In [None]:
Embedding_store_path = f"{root_dir}/Embedding_store"

In [None]:
# store_embeddings(texts,
#                  instructor_embeddings,
#                  sotre_name='instructEmbeddings',
#                  path=Embedding_store_path)

In [None]:
# db_instructEmbedd = load_embeddings(sotre_name='instructEmbeddings',
#                                     path=Embedding_store_path)

In [None]:
db_instructEmbedd = FAISS.from_documents(texts, instructor_embeddings)

In [None]:
retriever = db_instructEmbedd.as_retriever(search_kwargs={"k": 3})

In [None]:
retriever.search_type

'similarity'

In [None]:
retriever.search_kwargs

{'k': 3}

In [None]:
docs = retriever.get_relevant_documents("Who are the authors of GPT4All report?")

In [None]:
docs[0]

Document(page_content='We accompany this paper with the 800k point\nGPT4All-J dataset that is a superset of the origi-\nnal 400k points GPT4All dataset. We dedicated\nsubstantial attention to data preparation and cura-\ntion.Building on the GPT4All dataset, we curated\nthe GPT4All-J dataset by augmenting the origi-\nnal 400k GPT4All examples with new samples\nencompassing additional multi-turn QA samples\nand creative writing such as poetry, rap, and short\nstories. We designed prompt templates to create\ndifferent scenarios for creative writing. The cre-\native prompt template was inspired by Mad Libs\nstyle variations of ‘Write a [creative story type]\nabout [NOUN] in the style of [PERSON]‘. In ear-\nlier versions of GPT4All, we found that rather than\nwriting actual creative content, the model would\ndiscuss how it would go about writing the content.\nTraining on this new dataset allows GPT4All-J to\nwrite poems, songs, and plays with increased com-\npetence.\nWe used Atlas to infor

In [None]:
# create the chain to answer questions
qa_chain_instrucEmbed = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.2, ),
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

### OpenAI's Embeddings

In [None]:
from langchain.embeddings import OpenAIEmbeddings

In [None]:
embeddings = OpenAIEmbeddings()

In [None]:
# store_embeddings(texts,
#                  embeddings,
#                  sotre_name='openAIEmbeddings',
#                  path=Embedding_store_path)

In [None]:
# db_openAIEmbedd = load_embeddings(sotre_name='openAIEmbeddings',
#                                     path=Embedding_store_path)

In [None]:
db_openAIEmbedd = FAISS.from_documents(texts, embeddings)
retriever_openai = db_openAIEmbedd.as_retriever(search_kwargs={"k": 3})

In [None]:
# create the chain to answer questions
qa_chain_openai = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.2, ),
                                  chain_type="stuff",
                                  retriever=retriever_openai,
                                  return_source_documents=True)

### Testing both MODELS

In [None]:
## Cite sources

import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

def process_llm_response(llm_response):
    print(wrap_text_preserve_newlines(llm_response['result']))
    print('\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [None]:
query = 'who are the authors of GPT4all technical report?'

print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 The authors of the GPT4all technical report are Yuvanesh Anand, Zach Nussbaum, Brandon Duderstadt, Benjamin
M. Schmidt, Adam Treat, and Andriy Mulyar.

Sources:
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf


In [None]:
query = 'who are the authors of GPT4all technical report?'

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

-------------------OpenAI Embeddings------------------
 Yuvanesh Anand, Zach Nussbaum, Brandon Duderstadt, Benjamin M. Schmidt, Adam Treat, and Andriy Mulyar.

Sources:
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf






In [None]:
query = 'How was the GPT4All-J model trained?'

print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 The GPT4All-J model was trained over a massive curated corpus of assistant interactions including word
problems, multi-turn dialogue, code, poems, songs, and stories.

Sources:
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf


In [None]:
query = 'How was the GPT4All-J model trained?'

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

-------------------OpenAI Embeddings------------------
 The GPT4All-J model was trained with LoRA (Hu et al., 2021) on the 437,605 post-processed examples for four
epochs.

Sources:
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf






In [None]:
query = '"What was the cost of training the GPT4all model?"'

print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 The cost of training the GPT4all model was about $800 in OpenAI API credits and $100 for a Lambda Labs DGX
A100 8x 80GB.

Sources:
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All_Technical_Report.pdf


In [None]:
query = '"What was the cost of training the GPT4all model?"'

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

-------------------OpenAI Embeddings------------------
 $200

Sources:
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All_Technical_Report.pdf






In [None]:
query = "what license is GPT4All-J using?"

# print('-------------------OpenAI Embeddings------------------')
# llm_response = qa_chain_openai(query)
# process_llm_response(llm_response)
# print('\n\n\n')
print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 GPT4All-J is using an Apache 2 license.

Sources:
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All_Technical_Report.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf


In [None]:
query = "what license is GPT4All-J using?"

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

-------------------OpenAI Embeddings------------------
 GPT4All-J is using an Apache 2 license.

Sources:
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf






In [None]:
query = "what was the size of training dataset used for training GPT4All?"

# print('-------------------OpenAI Embeddings------------------')
# llm_response = qa_chain_openai(query)
# process_llm_response(llm_response)
# print('\n\n\n')
print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 Roughly one million prompt-response pairs.

Sources:
/content/gdrive/My Drive/Documents/2023_GPT4All_Technical_Report.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf


In [None]:
query = "what was the size of training dataset used for training GPT4All?"

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

-------------------OpenAI Embeddings------------------
 The final training dataset used for training GPT4All was 437,605 post-processed examples.

Sources:
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All_Technical_Report.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf






In [None]:
query = "what was the size of training dataset used for training GPT4All-J?"

# print('-------------------OpenAI Embeddings------------------')
# llm_response = qa_chain_openai(query)
# process_llm_response(llm_response)
# print('\n\n\n')
print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 The training dataset used for training GPT4All-J was 437,605 post-processed examples.

Sources:
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf


In [None]:
query = "what was the size of training dataset used for training GPT4All-J?"

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

-------------------OpenAI Embeddings------------------
 The final training dataset used for training GPT4All-J was 437,605 post-processed examples.

Sources:
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All_Technical_Report.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf






In [None]:
query = "what license is GPT4All using?"

# print('-------------------OpenAI Embeddings------------------')
# llm_response = qa_chain_openai(query)
# process_llm_response(llm_response)
# print('\n\n\n')
print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 GPT4All is using an Apache 2 license.

Sources:
/content/gdrive/My Drive/Documents/2023_GPT4All_Technical_Report.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf


In [None]:
query = "what license is GPT4All using?"

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

In [None]:
query = "Which MPT-7B model is the best?"

# print('-------------------OpenAI Embeddings------------------')
# llm_response = qa_chain_openai(query)
# process_llm_response(llm_response)
# print('\n\n\n')
print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 According to the evaluation data from the Self-Instruct paper (Wang et al., 2022), the best openly available
alpaca-lora model provided by user chainyo on huggingface has the lowest ground truth perplexity.

Sources:
/content/gdrive/My Drive/Documents/2023_GPT4All_Technical_Report.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf


In [None]:
query = "Which MPT-7B model is the best?"

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

-------------------OpenAI Embeddings------------------
 Alpaca Lora 7B is the best MPT-7B model according to the evaluation data from the Self-Instruct paper (Wang
et al., 2022).

Sources:
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf




