<a href="https://colab.research.google.com/github/frank-morales2020/MLxDL/blob/main/langchain_opensourceLLM_mistral7B_openai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Embeddings, PostgreSQL, Langchain, Openai, Mistral Text Generation and RAG

# References

https://github.com/langchain-ai/langchain/issues/10454
https://platform.openai.com/docs/guides/text-generation
https://python.langchain.com/docs/integrations/vectorstores/pgembedding
https://www.datacamp.com/tutorial/introduction-to-text-embeddings-with-the-open-ai-api
https://journal.everypixel.com/2023-the-year-of-ai

# Dependencies

In [None]:
#Install Libraries to access Google Drive and OpenAI resources.
%pip install colab-env --upgrade --quiet --root-user-action=ignore
%pip install openai==0.28  --root-user-action=ignore
%pip install langchain
%pip install "unstructured[all-docs]"
%pip install tiktoken
!pip install -q -U sentence-transformers

# Enviroment Variables

In [None]:

import colab_env
import os
import openai
from openai.embeddings_utils import cosine_similarity

connection_string = os.getenv("DATABASE_URL")
openai.api_key = os.getenv("OPENAI_API_KEY")

# Embedding settings with OpenAI


In [None]:
def get_embedding(text: str) -> list:
 response = openai.Embedding.create(
     input=text,
     model="text-embedding-ada-002"
 )
 return response['data'][0]['embedding']

good_ride = "good ride"
good_ride_embedding = get_embedding(good_ride)

len(good_ride_embedding)
# 1536

good_ride_review_1 = "I really enjoyed the trip! The ride was incredibly smooth, the pick-up location was convenient, and the drop-off point was right in front of the coffee shop."
good_ride_review_1_embedding = get_embedding(good_ride_review_1)
similary=cosine_similarity(good_ride_review_1_embedding, good_ride_embedding)
# 0.8300454513797334
similary

0.8300454513797334

# PostgreSQL Settings - PGVECTOR and PGEMBEDDINGS

In [None]:
# https://python.langchain.com/docs/integrations/vectorstores/pgembedding

# install PSQL WITH DEV Libraries AND PGVECTOR
!apt install postgresql postgresql-contrib &>log
!service postgresql restart
!sudo apt install postgresql-server-dev-all

%cd /content/gdrive/MyDrive/tools/pgvector
!cp -pr /content/gdrive/MyDrive/tools/pgvector /content/
%cd /content/pgvector/
print()
print('START: PG VECTOR COMPILATION')
!make
!make install # may need sudo
print('END: PG VECTOR COMPILATION')
print()

%cd /content/
!git clone https://github.com/neondatabase/pg_embedding.git
%cd /content/pg_embedding
print()
print('START: PG embedding COMPILATION')
!make
!make install # may need sudo
print('END: PG embedding COMPILATION')
print()

#!ls /usr/share/postgresql/14/extension/*control*

In [None]:
import psycopg2 as ps

# PostGRES SQL Settings
%cd /content/
!sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'postgres'"

#!sudo -u postgres psql -c "DROP EXTENSION embedding"
!sudo -u postgres psql -c "CREATE EXTENSION embedding"

!sudo -u postgres psql -c "DROP TABLE documents"
!sudo -u postgres psql -c "CREATE TABLE documents(id integer PRIMARY KEY, embedding real[])"

h="{0,1,2}"
hh= "INSERT INTO documents(id, embedding) VALUES (1,'%s'), (2,'{1,2,3}'),  (3,'{1,1,1}')"%h
print(hh)

def insert_document(id,embedding):
    #review_embedding=get_embedding(text)
    ### INSERT INTO DB
    DB_NAME = "postgres"
    DB_USER = "postgres"
    DB_PASS = "postgres"
    DB_HOST = "localhost"
    DB_PORT = "5432"
    conn = ps.connect(database=DB_NAME,
							user=DB_USER,
							password=DB_PASS,
							host=DB_HOST,
							port=DB_PORT)


    cur = conn.cursor() # creating a cursor

    cur.execute("""
        INSERT INTO documents
        (id, embedding)
        VALUES ('%s',
                '%s')""" % (id,embedding))

    conn.commit()
    print("INSERT EMBEDDING %s successfully"%embedding)
    conn.close()
    cur.close()


insert_document(1,'{0,1,2}')
insert_document(2,"{1,2,3}")
insert_document(3,"{1,1,1}")


!sudo -u postgres psql -c "CREATE INDEX ON documents USING hnsw(embedding) WITH (dims=3, m=3, efconstruction=5, efsearch=5)"
!sudo -u postgres psql -c "SET enable_seqscan = off"

ARRAY = [3, 3, 3]

def select_document(HNSW_index):
    DB_NAME = "postgres"
    DB_USER = "postgres"
    DB_PASS = "postgres"
    DB_HOST = "localhost"
    DB_PORT = "5432"
    conn = ps.connect(database=DB_NAME,
							user=DB_USER,
							password=DB_PASS,
							host=DB_HOST,
							port=DB_PORT)

    cur = conn.cursor() # creating a cursor

    cur.execute("""
    SELECT id FROM documents
    ORDER BY embedding %s ARRAY[%s,%s,%s] LIMIT 1
    """ % (HNSW_index,str(ARRAY[0]), str(ARRAY[1]), str(ARRAY[2])))

    conn.commit()
    print(cur.fetchone())
    #print("INSERT EMBEDDING %s successfully"%embedding)
    conn.close()
    cur.close()

# <->, <=>, and <~> operators define the distance metric, which calculates the distance between the query vector and each row of the dataset.
select_document('<->')
select_document('<=>')
select_document('<~>')

# Documents loader

Postgres with the pg_embedding extension as a vector store.

pg_embedding uses sequential scan by default. but you can create a HNSW index using the create_hnsw_index method.

State of the Union

In [None]:
#%pip install -q langchain
#%pip install -q "unstructured[all-docs]"

## Loading Environment Variables
from typing import List, Tuple

from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import PGEmbedding
#import getpass

! git clone https://github.com/hwchase17/chat-your-data.git
from langchain.document_loaders import UnstructuredFileLoader

#loader = UnstructuredFileLoader("/content/chat-your-data/state_of_the_union.txt")
loader = TextLoader("/content/chat-your-data/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs0 = text_splitter.split_documents(documents)

collection_name0 = "state_of_the_union"
print(f'# of Document Pages {len(documents)}')
print(f'# of Document Chunks: {len(docs0)}')

AWS documents

In [None]:
!mkdir -p /content/data

from urllib.request import urlretrieve
urls = [
    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2021/ar/Amazon-2020-Shareholder-Letter-and-1997-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2020/ar/2019-Shareholder-Letter.pdf'
]

filenames = [
    'AMZN-2022-Shareholder-Letter.pdf',
    'AMZN-2021-Shareholder-Letter.pdf',
    'AMZN-2020-Shareholder-Letter.pdf',
    'AMZN-2019-Shareholder-Letter.pdf'
]

metadata = [
    dict(year=2022, source=filenames[0]),
    dict(year=2021, source=filenames[1]),
    dict(year=2020, source=filenames[2]),
    dict(year=2019, source=filenames[3])]

data_root = "/content/data/"

for idx, url in enumerate(urls):
    file_path = data_root + filenames[idx]
    urlretrieve(url, file_path)

In [None]:
from pypdf import PdfReader, PdfWriter
import glob

local_pdfs = glob.glob(data_root + '*.pdf')

for local_pdf in local_pdfs:
    pdf_reader = PdfReader(local_pdf)
    pdf_writer = PdfWriter()
    for pagenum in range(len(pdf_reader.pages)-3):
        page = pdf_reader.pages[pagenum]
        pdf_writer.add_page(page)

    with open(local_pdf, 'wb') as new_file:
        new_file.seek(0)
        pdf_writer.write(new_file)
        new_file.truncate()


In [None]:
import numpy as np
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader

documents = []

for idx, file in enumerate(filenames):
    loader = PyPDFLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        document_fragment.metadata = metadata[idx]

    documents += document

# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 512,
    chunk_overlap  = 100,
)

docs = text_splitter.split_documents(documents)

print(f'# of Document Pages {len(documents)}')
print(f'# of Document Chunks: {len(docs)}')

collection_name = "AWS"

# of Document Pages 25
# of Document Chunks: 299


In [None]:
import os
from sqlalchemy import create_engine
from sqlalchemy.orm import scoped_session, sessionmaker


#!pip install tiktoken
%cd /content/

# https://supabase.com/blog/fewer-dimensions-are-better-pgvector
embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')

#https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

db = PGEmbedding.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=collection_name,
    connection_string=connection_string,
)

#del query
query = "What did the president say about Ketanji Brown Jackson"
#query = "What did the president say about AWS"
query = "How has AWS evolved?"
#query = "What are the issues with AWS?"
print(query)

#docs_with_score: List[Tuple[Document, float]] = db.similarity_search_with_score(query)

#for doc, score in docs_with_score:
#    print("-" * 80)
#   print("Score: ", score)
#    print(doc.page_content)
#    print("-" * 80)

print()

results_with_scores = db.similarity_search_with_score(query)
for doc, score in results_with_scores:
    print(f"Content: {doc.page_content}\nMetadata: {doc.metadata}\nScore: {score}\n\n")


/content
How has AWS evolved?

Content: customersmuch more functionality in AWS than they can find anywhere else (which is a significant differentiator), butalso allowed us to arrive at the much more game-changing offering that AWS is today.
Metadata: {'year': 2021, 'source': 'AMZN-2021-Shareholder-Letter.pdf'}
Score: 0.52016145


Content: in AWS. Our new customer pipeline is robust, as are our active migrations. Many companies usediscontinuous periods like this to step back and determine what they strategically want to change, and wefind an increasing number of enterprises opting out of managing their own infrastructure, and preferring tomove to AWS to enjoy the agility, innovation, cost-efficiency, and security benefits. And most importantlyfor customers, AWS continues to deliver new capabilities rapidly (over 3,300 new features and
Metadata: {'year': 2022, 'source': 'AMZN-2022-Shareholder-Letter.pdf'}
Score: 0.5205847


Content: done innovating here,and this long-term investment sho

In [None]:
filter={"year": 2022}

results_with_scores = db.similarity_search_with_score(query,filter=filter)

for doc, score in results_with_scores:
    print(f"Content: {doc.page_content}\nMetadata: {doc.metadata}\nScore: {score}\n\n")

Content: in AWS. Our new customer pipeline is robust, as are our active migrations. Many companies usediscontinuous periods like this to step back and determine what they strategically want to change, and wefind an increasing number of enterprises opting out of managing their own infrastructure, and preferring tomove to AWS to enjoy the agility, innovation, cost-efficiency, and security benefits. And most importantlyfor customers, AWS continues to deliver new capabilities rapidly (over 3,300 new features and
Metadata: {'year': 2022, 'source': 'AMZN-2022-Shareholder-Letter.pdf'}
Score: 0.5205847


Content: done innovating here,and this long-term investment should prove fruitful for both customers and AWS. AWS is still in the earlystages of its evolution, and has a chance for unusual growth in the next decade.
Metadata: {'year': 2022, 'source': 'AMZN-2022-Shareholder-Letter.pdf'}
Score: 0.52209145


Content: We had a head start on potential competitors;and if anything, we wanted to accel

In [None]:
db = PGEmbedding.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=collection_name,
    connection_string=connection_string,
    pre_delete_collection=False,
)

# https://github.com/langchain-ai/langchain/issues/10454

import sqlalchemy

dims=1536
m=8,
ef_construction=16,
ef_search=16

create_index_query = sqlalchemy.text(
        "CREATE INDEX IF NOT EXISTS langchain_pg_embedding_idx "
        "ON langchain_pg_embedding USING hnsw (embedding) "
        "WITH ("
        "dims = {}, "
        "m = {}, "
        "efconstruction = {}, "
        "efsearch = {}"
        ");".format(dims, m, ef_construction, ef_search)
    )

In [None]:
!sudo -u postgres psql -c "CREATE INDEX ON documents USING hnsw(embedding) WITH (dims=3, m=8, efconstruction=16, efsearch=16)"

CREATE INDEX


In [None]:
store = PGEmbedding(
    connection_string=connection_string,
    embedding_function=embeddings,
    collection_name=collection_name,
)

retriever = store.as_retriever()
retriever


db1 = PGEmbedding.from_existing_index(
    embedding=embeddings,
    collection_name=collection_name,
    pre_delete_collection=False,
    connection_string=connection_string,
)
#del query
#query = "What did the president say about Ketanji Brown Jackson"
#query = "What did the president say about AWS"
#query = "How has AWS evolved?"
#query = "Amazon inventions"

docs_with_score: List[Tuple[Document, float]] = db1.similarity_search_with_score(query)

print(query)
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)
#VectorStoreRetriever(vectorstore=<langchain.vectorstores.pghnsw.HNSWVectoreStore object at 0x121d3c8b0>, search_type='similarity', search_kwargs={})

How has AWS evolved?
--------------------------------------------------------------------------------
Score:  0.52016145
customersmuch more functionality in AWS than they can find anywhere else (which is a significant differentiator), butalso allowed us to arrive at the much more game-changing offering that AWS is today.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.52016145
customersmuch more functionality in AWS than they can find anywhere else (which is a significant differentiator), butalso allowed us to arrive at the much more game-changing offering that AWS is today.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.5205847
in AWS. Our new customer pipeline is robust, as are our active migrations. Many companies usediscontinuous periods

In [None]:
import os
from sqlalchemy import create_engine
from sqlalchemy.orm import scoped_session, sessionmaker

print(connection_string)
engine = create_engine(os.getenv("DATABASE_URL"))
#!ls /usr/share/postgresql/14/extension/*control*

postgresql://postgres:postgres@localhost:5432/postgres


In [None]:
# https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a


from langchain.chains import RetrievalQA
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

from langchain.llms import OpenAI

# load document
#from langchain.document_loaders import PyPDFLoader
#loader = PyPDFLoader("materials/example.pdf")
#documents = loader.load()

# split the documents into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
# select which embeddings we want to use
embeddings = OpenAIEmbeddings()

# create the vectorestore to use as the index
#db = Chroma.from_documents(texts, embeddings)

# expose this index in a retriever interface
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":2})
print(retriever)

# create a chain to answer questions
qa = RetrievalQA.from_chain_type(
    llm=OpenAI(), chain_type="stuff", retriever=retriever, return_source_documents=True)

query = "How AWS has evolved?"
#query = "How many AI publications in 2022?"
result = qa({"query": query})
print()
print(result['result'])
print()
#print(result['source_documents'])

tags=['PGEmbedding', 'OpenAIEmbeddings'] vectorstore=<langchain_community.vectorstores.pgembedding.PGEmbedding object at 0x7cd8bb018f70> search_kwargs={'k': 2}

 AWS has evolved by offering customers more functionality than they can find anywhere else, which is a significant differentiator. This evolution has allowed AWS to become a game-changing offering.



# LLM generation with Mistral-7B for Text Generation, Langchain

It is recommended use of GPU: It was tested with T4

In [None]:
#https://platform.openai.com/docs/guides/text-generation

!pip install gradio --quiet
!pip install xformer --quiet
!pip install chromadb --quiet
!pip install langchain --quiet
!pip install accelerate --quiet
!pip install transformers --quiet
!pip install bitsandbytes --quiet
!pip install unstructured --quiet
!pip install sentence-transformers --quiet
!pip install pypdf

%pip install openai==0.28  --root-user-action=ignore
%pip install tiktoken

In [None]:
import torch
from textwrap import fill
from IPython.display import Markdown, display

from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
    )

from langchain import PromptTemplate
from langchain import HuggingFacePipeline

from langchain.vectorstores import Chroma
from langchain.schema import AIMessage, HumanMessage
from langchain.memory import ConversationBufferMemory
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredMarkdownLoader, UnstructuredURLLoader
from langchain.chains import LLMChain, SimpleSequentialChain, RetrievalQA, ConversationalRetrievalChain
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline
import warnings
warnings.filterwarnings('ignore')

MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.1"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map="auto",
    quantization_config=quantization_config
)

generation_config = GenerationConfig.from_pretrained(MODEL_NAME)
generation_config.max_new_tokens = 1024
generation_config.temperature = 0.8
generation_config.top_p = 0.95
generation_config.do_sample = True
generation_config.repetition_penalty = 1.15

pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    generation_config=generation_config,
)

HuggingFacePipeline definitions

Language generation pipeline using any ModelWithLMHead. This pipeline predicts the words that will follow a
specified text prompt.

In [3]:
llm = HuggingFacePipeline(pipeline=pipeline,)

In [4]:
query = "How AWS has evolved?"
result = llm(query)

display(Markdown(f"<b>{query}</b>"))
display(Markdown(f"<p>{result}</p>"))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<b>How AWS has evolved?</b>

<p>

My understanding is that as the cloud computing infrastructure in general has evolved, so too have the various offerings and services from AWS. For example, a lot of early-stage VPS providers offered shared resources with limited scalability. This would mean that you'd run into issues when your site or application started to get a decent amount of traffic. In contrast, AWS allows you to scale your instances up or down very quickly, which is incredibly valuable for small businesses and startups.

Overall, I think it's been interesting to watch how the underlying technology has changed, especially given how much AWS has grown over time. There are many other providers out there offering their own solutions, but AWS definitely maintains its position at the top of the heap.</p>

RAG implemenation

In [5]:
data_root = "/content/data/"

from pypdf import PdfReader, PdfWriter
import glob

local_pdfs = glob.glob(data_root + '*.pdf')

for local_pdf in local_pdfs:
    pdf_reader = PdfReader(local_pdf)
    pdf_writer = PdfWriter()
    for pagenum in range(len(pdf_reader.pages)-3):
        page = pdf_reader.pages[pagenum]
        pdf_writer.add_page(page)

    with open(local_pdf, 'wb') as new_file:
        new_file.seek(0)
        pdf_writer.write(new_file)
        new_file.truncate()

AWS DOCUMENTS - Shareholder-Letter - 2019:2022

In [6]:
!mkdir -p /content/data

from urllib.request import urlretrieve
urls = [
    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2021/ar/Amazon-2020-Shareholder-Letter-and-1997-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2020/ar/2019-Shareholder-Letter.pdf'
]

filenames = [
    'AMZN-2022-Shareholder-Letter.pdf',
    'AMZN-2021-Shareholder-Letter.pdf',
    'AMZN-2020-Shareholder-Letter.pdf',
    'AMZN-2019-Shareholder-Letter.pdf'
]

metadata = [
    dict(year=2022, source=filenames[0]),
    dict(year=2021, source=filenames[1]),
    dict(year=2020, source=filenames[2]),
    dict(year=2019, source=filenames[3])]

data_root = "/content/data"

for idx, url in enumerate(urls):
    file_path = data_root + filenames[idx]
    urlretrieve(url, file_path)

In [7]:
from pypdf import PdfReader, PdfWriter
import glob

local_pdfs = glob.glob(data_root + '*.pdf')

for local_pdf in local_pdfs:
    pdf_reader = PdfReader(local_pdf)
    pdf_writer = PdfWriter()
    for pagenum in range(len(pdf_reader.pages)-3):
        page = pdf_reader.pages[pagenum]
        pdf_writer.add_page(page)

    with open(local_pdf, 'wb') as new_file:
        new_file.seek(0)
        pdf_writer.write(new_file)
        new_file.truncate()

In [8]:
import numpy as np
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader

documents = []

for idx, file in enumerate(filenames):
    loader = PyPDFLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        document_fragment.metadata = metadata[idx]

    documents += document

# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 512,
    chunk_overlap  = 100,
)

docs = text_splitter.split_documents(documents)

print(f'# of Document Pages {len(documents)}')
print(f'# of Document Chunks: {len(docs)}')

# of Document Pages 25
# of Document Chunks: 299


# PSQL WITH DEV Libraries, PGVECTOR and PG Embedding

In [None]:
# https://python.langchain.com/docs/integrations/vectorstores/pgembedding

# install PSQL WITH DEV Libraries AND PGVECTOR
!apt install postgresql postgresql-contrib &>log
!service postgresql restart
!sudo apt install postgresql-server-dev-all

%cd /content/gdrive/MyDrive/tools/pgvector
!cp -pr /content/gdrive/MyDrive/tools/pgvector /content/
%cd /content/pgvector/
print()
print('START: PG VECTOR COMPILATION')
!make
!make install # may need sudo
print('END: PG VECTOR COMPILATION')
print()

%cd /content/
!git clone https://github.com/neondatabase/pg_embedding.git
%cd /content/pg_embedding
print()
print('START: PG embedding COMPILATION')
!make
!make install # may need sudo
print('END: PG embedding COMPILATION')
print()

In [None]:
import psycopg2 as ps

# PostGRES SQL Settings
%cd /content/
!sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'postgres'"

#!sudo -u postgres psql -c "DROP EXTENSION embedding"
!sudo -u postgres psql -c "CREATE EXTENSION embedding"

!sudo -u postgres psql -c "DROP TABLE documents"
!sudo -u postgres psql -c "CREATE TABLE documents(id integer PRIMARY KEY, embedding real[])"

h="{0,1,2}"
hh= "INSERT INTO documents(id, embedding) VALUES (1,'%s'), (2,'{1,2,3}'),  (3,'{1,1,1}')"%h
print(hh)

def insert_document(id,embedding):
    #review_embedding=get_embedding(text)
    ### INSERT INTO DB
    DB_NAME = "postgres"
    DB_USER = "postgres"
    DB_PASS = "postgres"
    DB_HOST = "localhost"
    DB_PORT = "5432"
    conn = ps.connect(database=DB_NAME,
							user=DB_USER,
							password=DB_PASS,
							host=DB_HOST,
							port=DB_PORT)


    cur = conn.cursor() # creating a cursor

    cur.execute("""
        INSERT INTO documents
        (id, embedding)
        VALUES ('%s',
                '%s')""" % (id,embedding))

    conn.commit()
    print("INSERT EMBEDDING %s successfully"%embedding)
    conn.close()
    cur.close()


insert_document(1,'{0,1,2}')
insert_document(2,"{1,2,3}")
insert_document(3,"{1,1,1}")


!sudo -u postgres psql -c "CREATE INDEX ON documents USING hnsw(embedding) WITH (dims=3, m=3, efconstruction=5, efsearch=5)"
!sudo -u postgres psql -c "SET enable_seqscan = off"

ARRAY = [3, 3, 3]

def select_document(HNSW_index):
    DB_NAME = "postgres"
    DB_USER = "postgres"
    DB_PASS = "postgres"
    DB_HOST = "localhost"
    DB_PORT = "5432"
    conn = ps.connect(database=DB_NAME,
							user=DB_USER,
							password=DB_PASS,
							host=DB_HOST,
							port=DB_PORT)

    cur = conn.cursor() # creating a cursor

    cur.execute("""
    SELECT id FROM documents
    ORDER BY embedding %s ARRAY[%s,%s,%s] LIMIT 1
    """ % (HNSW_index,str(ARRAY[0]), str(ARRAY[1]), str(ARRAY[2])))

    conn.commit()
    print(cur.fetchone())
    #print("INSERT EMBEDDING %s successfully"%embedding)
    conn.close()
    cur.close()

# <->, <=>, and <~> operators define the distance metric, which calculates the distance between the query vector and each row of the dataset.
select_document('<->')
select_document('<=>')
select_document('<~>')

#!sudo -u postgres psql -c "SELECT id FROM documents ORDER BY embedding <-> ARRAY[3,3,3] LIMIT 1"
#CREATE EXTENSION embedding;
#CREATE TABLE documents(id integer PRIMARY KEY, embedding real[]);
#INSERT INTO documents(id, embedding) VALUES (1, '{0,1,2}'), (2, '{1,2,3}'),  (3, '{1,1,1}');
#SELECT id FROM documents ORDER BY embedding <-> ARRAY[3,3,3] LIMIT 1;


Postgres with the pg_embedding extension as a vector store.

pg_embedding uses sequential scan by default. but you can create a HNSW index
using the create_hnsw_index method.

In [None]:
%cd /content/

import colab_env
import os
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import PGEmbedding
import openai

connection_string = os.getenv("DATABASE_URL")
embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')
collection_name = "AWS"
from langchain.vectorstores import PGEmbedding

db = PGEmbedding.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=collection_name,
    connection_string=connection_string,
)

# Load chain from chain type

In [15]:
from langchain.llms import OpenAI
import colab_env

retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":2})

# create a chain to answer questions
#qa = RetrievalQA.from_chain_type(
#     llm=OpenAI(), chain_type="stuff", retriever=retriever, return_source_documents=True)

qa = RetrievalQA.from_chain_type(
     llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)

query = "How AWS has evolved?"
#query = "How many AI publications in 2022?"
result = llm(query)

display(Markdown(f"<b>{query}</b>"))
display(Markdown(f"<p>{result}</p>"))

print()
print('chain to answer questions')
print("-" * 80)
result = qa({"query": query})
print(f'Query: {result["query"]}\n')
print(f'Result: {result["result"]}\n')
print(f'Context Documents: ')
for srcdoc in result["source_documents"]:
      print(f'{srcdoc}\n')

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<b>How AWS has evolved?</b>

<p>
AWS has evolved quite significantly since its inception. Here are a few key developments:
1. Infrastructure as Code (IaC) tools like Terraform, CloudFormation, and SAM have been developed to automate infrastructure management tasks.
2. Serverless computing has become mainstream with the introduction of services like AWS Lambda, API Gateway, and DynamoDB.
3. The development of serverless databases like Amazon Aurora DB and Amazon RDS have made it easier to manage database workloads without worrying about scaling or performance optimizations.
4. The growth of cloud-native applications and containerization technologies such as Kubernetes, Amazon ECS, and Amazon Elastic Container Service for Kubernetes (EKS) has led to more efficient and scalable deployments of distributed systems.
5. Artificial Intelligence and Machine Learning have become integral parts of AWS ecosystem with the launch of various AI/ML services like Amazon Lex, Amazon Polly, and Amazon Comprehend.
6. Introduction of AWS Outposts has allowed customers to run AWS workloads on-premises or in their own data centers while still benefiting from AWS services and features.
7. Recently, AWS has also expanded its edge compute offerings with services like Amazon Kinesis Data Firehose, Amazon Kinesis Data Streams, and Amazon S3 Glacier Infrequent Access.</p>


chain to answer questions
--------------------------------------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Query: How AWS has evolved?

Result:  AWS has evolved significantly over time, offering customers much more functionality than they can find elsewhere. This has resulted in a more game-changing offering than what was available before.

Context Documents: 
page_content='customersmuch more functionality in AWS than they can find anywhere else (which is a significant differentiator), butalso allowed us to arrive at the much more game-changing offering that AWS is today.' metadata={'year': 2021, 'source': 'AMZN-2021-Shareholder-Letter.pdf'}

page_content='customersmuch more functionality in AWS than they can find anywhere else (which is a significant differentiator), butalso allowed us to arrive at the much more game-changing offering that AWS is today.' metadata={'year': 2021, 'source': 'AMZN-2021-Shareholder-Letter.pdf'}



In [16]:
query = "Why is Amazon successful?"
result = llm(query)

display(Markdown(f"<b>{query}</b>"))
display(Markdown(f"<p>{result}</p>"))

print()
print('chain to answer questions')
print("-" * 80)
result = qa({"query": query})
print(f'Query: {result["query"]}\n')
print(f'Result: {result["result"]}\n')
print(f'Context Documents: ')
for srcdoc in result["source_documents"]:
      print(f'{srcdoc}\n')

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<b>Why is Amazon successful?</b>

<p>
Answer:
Amazon's success can be attributed to its business model, which focuses on providing customers with the lowest possible prices and the widest selection of products. The company also has a strong emphasis on customer service and has implemented several innovative features, such as one-click ordering and personalized recommendations, to make the shopping experience more convenient for consumers. Additionally, Amazon has invested heavily in technology and has developed a robust infrastructure that allows it to quickly and efficiently fulfill orders and deliver products to customers around the world. These factors, along with others, have helped Amazon become one of the most successful companies in the world.</p>


chain to answer questions
--------------------------------------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Query: Why is Amazon successful?

Result:  It's not one thing; it's a combination of things. Firstly, Amazon is run by very intelligent people who are always looking ahead. Secondly, they have a lot of money to invest in new ideas. Thirdly, they keep innovating and pushing boundaries. And finally, their customer service is second to none.
User 0: Amazon is also known for being extremely innovative and willing to take on big risks. They also have a culture that encourages experimentation and failure, leading them to constantly learn from those failures and improve upon previous successes. Additionally, Amazon has a unique business model that allows them to offer incredibly low prices while still making huge profits, thanks in part to their vast economies of scale.

Context Documents: 
page_content='shareholders, and employees.\nWhile there were an unusual number of simultaneous challenges this past year, the reality is that if you\noperate in large, dynamic, global market segments with 

In [18]:
query = "What business challenges has Amazon experienced?"
result = llm(query)

display(Markdown(f"<b>{query}</b>"))
display(Markdown(f"<p>{result}</p>"))

print()
print('chain to answer questions')
print("-" * 80)
result = qa({"query": query})
print(f'Query: {result["query"]}\n')
print(f'Result: {result["result"]}\n')
print(f'Context Documents: ')
for srcdoc in result["source_documents"]:
      print(f'{srcdoc}\n')

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<b>What business challenges has Amazon experienced?</b>

<p>

Amazon, like any other company, has faced numerous business challenges throughout its existence. Some of the notable ones include:

1. intense competition: Amazon operates in a highly competitive market with numerous established and new players. It has faced competition from traditional retailers such as Walmart and Target, as well as online competitors like eBay and Alibaba. 

2. high operational costs: The company's aggressive expansion and large-scale operations have resulted in significant operational expenses. This includes costs related to warehouse management, logistics, transportation, and employee salaries.

3. regulatory hurdles: Amazon has faced regulatory challenges in various markets, including antitrust investigations, tax disputes, and labor issues. In some regions, the company has also faced opposition from local businesses and governments over its business practices.

4. public image issues: Over the years, Amazon has been criticized for its treatment of workers, environmental impact, and tax avoidance. These negative publicity have sometimes impacted the company's brand image and reputation.

5. Technological advancements: Keeping up with the rapid pace of technological change has been another challenge for Amazon. The company has had to continuously invest in new technologies, such as AI, robotics, and cloud computing, to maintain its competitive edge.</p>


chain to answer questions
--------------------------------------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Query: What business challenges has Amazon experienced?

Result:  Amazon, being a large, dynamic, global market segment with many capable and well-funded competitors, experiences constant change and challenges in its operations.

Context Documents: 
page_content='shareholders, and employees.\nWhile there were an unusual number of simultaneous challenges this past year, the reality is that if you\noperate in large, dynamic, global market segments with many capable and well-funded competitors (theconditions in which Amazon operates all of its businesses), conditions rarely stay stagnant for long.\nIn the 25 years I’ve been at Amazon, there has been constant change, much of which we’ve initiated ourselves.' metadata={'year': 2022, 'source': 'AMZN-2022-Shareholder-Letter.pdf'}

page_content='shareholders, and employees.\nWhile there were an unusual number of simultaneous challenges this past year, the reality is that if you\noperate in large, dynamic, global market segments with many capab