<a href="https://colab.research.google.com/github/frank-morales2020/MLxDL/blob/main/GPT_LANGCHAIN_POSTGRESQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DEPENDENCIES

In [None]:
#added by Frank Morales(FM) 22/02/2024
%pip install openai  --root-user-action=ignore
!pip install llama_index phoenix pyvis network
!pip install llama_hub
%pip install colab-env --upgrade --quiet --root-user-action=ignore
!pip install accelerate
#!pip install typing_extensions

!pip install langchain --quiet
!pip install accelerate --quiet
!pip install transformers --quiet
!pip install bitsandbytes --quiet

# CPU CONFIGURATION

In [None]:
!git clone https://github.com/ultralytics/yolov5  # clone
%cd yolov5
%pip install -qr requirements.txt comet_ml

In [3]:
import torch
import utils
display = utils.notebook_init()  # checks

YOLOv5 🚀 v7.0-284-g95ebf68f Python-3.10.12 torch-2.1.0+cu121 CPU


Setup complete ✅ (2 CPUs, 12.7 GB RAM, 27.6/225.8 GB disk)


# DATA LOADER - TEXT FILES

State of the union

In [4]:

from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

%cd /content/

!git clone https://github.com/hwchase17/chat-your-data.git
from langchain.document_loaders import UnstructuredFileLoader

#loader = UnstructuredFileLoader("/content/chat-your-data/state_of_the_union.txt")
loader = TextLoader("/content/chat-your-data/state_of_the_union.txt")
#/content/chat-your-data/state_of_the_union.txt
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs2 = text_splitter.split_documents(documents)

collection_name0 = "state_of_the_union"
print(f'# of Document Pages {len(documents)}')
print(f'# of Document Chunks: {len(docs2)}')

/content
Cloning into 'chat-your-data'...
remote: Enumerating objects: 62, done.[K
remote: Counting objects: 100% (28/28), done.[K
remote: Compressing objects: 100% (15/15), done.[K
remote: Total 62 (delta 17), reused 15 (delta 13), pack-reused 34[K
Receiving objects: 100% (62/62), 24.22 MiB | 31.20 MiB/s, done.
Resolving deltas: 100% (23/23), done.
# of Document Pages 1
# of Document Chunks: 42


Paul Graham Essay

In [5]:
import colab_env
import openai
import os
openai.api_key = os.getenv("OPENAI_API_KEY")

%cd /content/
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt' -O pg_essay.txt

Mounted at /content/gdrive
/content
--2024-02-24 13:14:55--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘pg_essay.txt’


2024-02-24 13:14:55 (5.76 MB/s) - ‘pg_essay.txt’ saved [75042/75042]



# POSTGRESQL

In [None]:
# https://python.langchain.com/docs/integrations/vectorstores/pgembedding

# install PSQL WITH DEV Libraries AND PGVECTOR
!apt install postgresql postgresql-contrib &>log
!service postgresql restart
!sudo apt install postgresql-server-dev-all

In [None]:
%cd /content/gdrive/MyDrive/tools/pgvector
!cp -pr /content/gdrive/MyDrive/tools/pgvector /content/
%cd /content/pgvector/
print()
print('START: PG VECTOR COMPILATION')
!make
!make install # may need sudo
print('END: PG VECTOR COMPILATION')
print()

%cd /content/
!git clone https://github.com/neondatabase/pg_embedding.git
%cd /content/pg_embedding
print()
print('START: PG embedding COMPILATION')
!make
!make install # may need sudo
print('END: PG embedding COMPILATION')
print()

#!ls /usr/share/postgresql/14/extension/*control*

In [8]:
import psycopg2 as ps

# PostGRES SQL Settings
%cd /content/
!sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'postgres'"

#!sudo -u postgres psql -c "DROP EXTENSION embedding"
!sudo -u postgres psql -c "CREATE EXTENSION embedding"

!sudo -u postgres psql -c "DROP TABLE documents"
!sudo -u postgres psql -c "CREATE TABLE documents(id integer PRIMARY KEY, embedding real[])"

h="{0,1,2}"
hh= "INSERT INTO documents(id, embedding) VALUES (1,'%s'), (2,'{1,2,3}'),  (3,'{1,1,1}')"%h
print(hh)

def insert_document(id,embedding):
    #review_embedding=get_embedding(text)
    ### INSERT INTO DB
    DB_NAME = "postgres"
    DB_USER = "postgres"
    DB_PASS = "postgres"
    DB_HOST = "localhost"
    DB_PORT = "5432"
    conn = ps.connect(database=DB_NAME,
							user=DB_USER,
							password=DB_PASS,
							host=DB_HOST,
							port=DB_PORT)


    cur = conn.cursor() # creating a cursor

    cur.execute("""
        INSERT INTO documents
        (id, embedding)
        VALUES ('%s',
                '%s')""" % (id,embedding))

    conn.commit()
    print("INSERT EMBEDDING %s successfully"%embedding)
    conn.close()
    cur.close()


insert_document(1,'{0,1,2}')
insert_document(2,"{1,2,3}")
insert_document(3,"{1,1,1}")

!sudo -u postgres psql -c "CREATE INDEX ON documents USING hnsw(embedding) WITH (dims=3, m=8, efconstruction=16, efsearch=16)"
#!sudo -u postgres psql -c "CREATE INDEX ON documents USING hnsw(embedding) WITH (dims=3, m=3, efconstruction=5, efsearch=5)"
!sudo -u postgres psql -c "SET enable_seqscan = off"

ARRAY = [3, 3, 3]

def select_document(HNSW_index):
    DB_NAME = "postgres"
    DB_USER = "postgres"
    DB_PASS = "postgres"
    DB_HOST = "localhost"
    DB_PORT = "5432"
    conn = ps.connect(database=DB_NAME,
							user=DB_USER,
							password=DB_PASS,
							host=DB_HOST,
							port=DB_PORT)

    cur = conn.cursor() # creating a cursor

    cur.execute("""
    SELECT id FROM documents
    ORDER BY embedding %s ARRAY[%s,%s,%s] LIMIT 1
    """ % (HNSW_index,str(ARRAY[0]), str(ARRAY[1]), str(ARRAY[2])))

    conn.commit()
    print(cur.fetchone())
    #print("INSERT EMBEDDING %s successfully"%embedding)
    conn.close()
    cur.close()

# <->, <=>, and <~> operators define the distance metric, which calculates the distance between the query vector and each row of the dataset.
select_document('<->')
select_document('<=>')
select_document('<~>')

/content
ALTER ROLE
CREATE EXTENSION
ERROR:  table "documents" does not exist
CREATE TABLE
INSERT INTO documents(id, embedding) VALUES (1,'{0,1,2}'), (2,'{1,2,3}'),  (3,'{1,1,1}')
INSERT EMBEDDING {0,1,2} successfully
INSERT EMBEDDING {1,2,3} successfully
INSERT EMBEDDING {1,1,1} successfully
CREATE INDEX
SET
(2,)
(3,)
(2,)


# DATA LOADER - PDF FILES

AMAZON Shareholder Letters

In [9]:
!mkdir -p /content/data

from urllib.request import urlretrieve
urls = [
    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2021/ar/Amazon-2020-Shareholder-Letter-and-1997-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2020/ar/2019-Shareholder-Letter.pdf'
]

filenames = [
    'AMZN-2022-Shareholder-Letter.pdf',
    'AMZN-2021-Shareholder-Letter.pdf',
    'AMZN-2020-Shareholder-Letter.pdf',
    'AMZN-2019-Shareholder-Letter.pdf'
]

metadata = [
    dict(year=2022, source=filenames[0]),
    dict(year=2021, source=filenames[1]),
    dict(year=2020, source=filenames[2]),
    dict(year=2019, source=filenames[3])]

data_root = "/content/data/"

for idx, url in enumerate(urls):
    file_path = data_root + filenames[idx]
    urlretrieve(url, file_path)

In [10]:
from pypdf import PdfReader, PdfWriter
import glob

local_pdfs = glob.glob(data_root + '*.pdf')

for local_pdf in local_pdfs:
    pdf_reader = PdfReader(local_pdf)
    pdf_writer = PdfWriter()
    for pagenum in range(len(pdf_reader.pages)-3):
        page = pdf_reader.pages[pagenum]
        pdf_writer.add_page(page)

    with open(local_pdf, 'wb') as new_file:
        new_file.seek(0)
        pdf_writer.write(new_file)
        new_file.truncate()

In [11]:
import numpy as np
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader

documents = []

for idx, file in enumerate(filenames):
    loader = PyPDFLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        document_fragment.metadata = metadata[idx]

    documents += document

# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 512,
    chunk_overlap  = 100,
)

#text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=100)
### for the DB embedding
docs = text_splitter.split_documents(documents)


docs = text_splitter.split_documents(documents)

print(f'# of Document Pages {len(documents)}')
print(f'# of Document Chunks: {len(docs)}')

collection_name = "AWS"

# of Document Pages 25
# of Document Chunks: 299


# OPENAI - SETTINGS

In [12]:
import warnings
warnings.filterwarnings('ignore')

import colab_env
import openai
import os
openai.api_key = os.getenv("OPENAI_API_KEY")

from openai import OpenAI
client = OpenAI()

In [14]:
def gpt_reponse(query):
  response = client.chat.completions.create(
    model="gpt-4",
    #model="gpt-3.5-turbo"
    #response_format={ "type": "json_object" },
    messages=[
      #{"role": "system", "content": "You are a helpful assistant designed to output JSON."},
      {"role": "system", "content": "You are a helpful assistant designed to output text."},
      {"role": "user", "content": query}
    ]
  )

  return response

In [15]:
query = "Who won the world series in 2009 and who lost, explained?, who were the managers?"
response=gpt_reponse(query)

print()
print("-" * 80)
print('Question: %s'%query)
print("-" * 80)
print('Answer: %s'%response.choices[0].message.content)
print("-" * 80)


--------------------------------------------------------------------------------
Question: Who won the world series in 2009 and who lost, explained?, who were the managers?
--------------------------------------------------------------------------------
Answer: The 2009 World Series was won by the New York Yankees and the team that lost was the Philadelphia Phillies. The series concluded in six games with the Yankees winning four games to the Phillies' two.

The manager for the New York Yankees in 2009 was Joe Girardi. This was Girardi's second season as the Yankees' manager and this victory marked his first World Series win as a manager.

On the other hand, the Philadelphia Phillies were managed by Charlie Manuel. 2009 represented Manuel's fifth year as the Phillies' manager and he had led the team to victory in the previous World Series in 2008.
--------------------------------------------------------------------------------


# GPT4 MODEL - EXAMPLE

In [85]:
query = "How AWS has evolved?"
response=gpt_reponse(query)
print("-" * 80)
print('Question: %s'%query)
print("-" * 80)
print('Answer: %s'%response.choices[0].message.content)
print("-" * 80)

--------------------------------------------------------------------------------
Question: How AWS has evolved?
--------------------------------------------------------------------------------
Answer: Amazon Web Services (AWS), a subsidiary of Amazon.com, launched in 2006. Since then, it has continuously evolved and expanded its services. Here is a brief overview:

1. Inception (2002-2006): AWS platform, born out of Amazon's need to scale operations and its retail application, was initially a collection of tools and services for developers. The official launch of AWS occurred in 2006.

2. The Birth of EC2 and S3 (2006): Arguably the two most well-known AWS services, EC2 (Elastic Compute Cloud) and S3 (Simple Storage Service) were launched, marking the birth of cloud infrastructure services.

3. Service Expansion (2007-2010): AWS launched various services like SimpleDB, Elastic Block Store (EBS), Content Delivery Network (CDN), and more.

4. Global Expansion (2010-2011): AWS began its g

In [86]:
query = "I bought an ice cream for 6 kids. Each cone was $1.25 and I paid with a $10 bill. How many dollars did I get back? Explain first before answering."
response=gpt_reponse(query)

print()
print("-" * 80)
print('Question: %s'%query)
print("-" * 80)
print('Answer: %s'%response.choices[0].message.content)
print("-" * 80)


--------------------------------------------------------------------------------
Question: I bought an ice cream for 6 kids. Each cone was $1.25 and I paid with a $10 bill. How many dollars did I get back? Explain first before answering.
--------------------------------------------------------------------------------
Answer: First, you have to calculate the total cost of the ice cream cones for the 6 kids. Given that each cone costs $1.25, you simply multiply this price by the number of kids, which is 6. So, $1.25 x 6 equals $7.50. This is the total cost for the ice cream cones.

Next, we subtract the total cost of the ice cream from the amount you paid with which was a $10 bill. So, $10 - $7.50 equals $2.50.

Therefore, you received $2.50 back.
--------------------------------------------------------------------------------


# EMBEDDING - WITH PDF FILES

In [17]:
# 20x faster than pgvector: introducing pg_embedding extension for vector search in Postgres and LangChain
# https://neon.tech/blog/pg-embedding-extension-for-vector-search

#ADDED By FM 22/02/2024

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import PGEmbedding

# https://supabase.com/blog/fewer-dimensions-are-better-pgvector
embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')

collection_name='AWS'
connection_string = os.getenv("DATABASE_URL")

import llama_index.core.readers as readers
reader = readers.SimpleDirectoryReader(input_files=["/content/pg_essay.txt"])
## for the RAG
#docs = reader.load_data()

### FOR PDF FILES
db = PGEmbedding.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=collection_name,
    connection_string=connection_string,
)

#db.create_hnsw_index(dims = 1536, m = 8, ef_construction = 16, ef_search = 16)
#!sudo -u postgres psql -c "CREATE INDEX ON documents USING hnsw(embedding) WITH (dims=3, m=8, efconstruction=16, efsearch=16)"


In [18]:
#ADDED By FM 22/02/2024

from typing import List, Tuple
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import PGEmbedding

query='AI'

docs_with_score: List[Tuple[Document, float]] = db.similarity_search_with_score(query)

print()
print(query)
print()

for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)


AI

--------------------------------------------------------------------------------
Score:  0.6194219
drones for Prime Air,to Alexa, to the many machine learning services AWS offers (where AWS has the broadest machine learningfunctionality and customer base of any cloud provider). More recently, a newer form of machine learning,called Generative AI, has burst onto the scene and promises to significantly accelerate machine learningadoption. Generative AI is based on very Large Language Models (trained on up to hundreds of billionsof parameters, and growing), across expansive datasets, and has radically
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.62688583
assistantlike Alexa (launched in 2014) that you could use to access entertainment, control your smart home, shop,and retrieve all sorts of information.
--------------------------------------------------------

# LANG CHAIN

In [19]:
import torch
from textwrap import fill
from IPython.display import Markdown, display

from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
    )

from langchain import PromptTemplate
from langchain import HuggingFacePipeline

from langchain.vectorstores import Chroma
from langchain.schema import AIMessage, HumanMessage
from langchain.memory import ConversationBufferMemory
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredMarkdownLoader, UnstructuredURLLoader
from langchain.chains import LLMChain, SimpleSequentialChain, RetrievalQA, ConversationalRetrievalChain
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline
import warnings
warnings.filterwarnings('ignore')

**Create a chain to answer questions**

In [37]:
openai.__version__

'1.12.0'

In [55]:
from langchain.llms import OpenAI

retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":2})

# create a chain to answer questions
qa = RetrievalQA.from_chain_type(
     llm=OpenAI(model_name='gpt-3.5-turbo-instruct',max_tokens=1512,temperature=0.9), chain_type="stuff", retriever=retriever, return_source_documents=True)


**MODEL answer questions**

In [81]:
#MODEL	DESCRIPTION	CONTEXT WINDOW	TRAINING DATA
#gpt-3.5-turbo-0125
#gpt-3.5-turbo-instruct
# https://platform.openai.com/docs/models/overview
# https://pypi.org/project/openai/

from langchain.llms import OpenAI


#llm=OpenAI(model='gpt-4',max_tokens=1512,temperature=0.9)
llm=OpenAI(model='gpt-3.5-turbo-instruct',max_tokens=1512,temperature=0.9)


#from langchain_community.llms import OpenAI as OpenAIv1
#llm = OpenAIv1(model_name="gpt-4")


print()
print("-" * 80)
query = "How AWS has evolved?"
#query = "How many AI publications in 2022?"
result = llm(query)
display(Markdown(f"<p>{query}</p>"))
print("-" * 80)
display(Markdown(f"<p>{result}</p>"))


--------------------------------------------------------------------------------


<p>How AWS has evolved?</p>

--------------------------------------------------------------------------------


<p>

1. Cloud Computing: AWS revolutionized the computing industry by introducing the concept of cloud computing. With AWS, businesses can easily access computing resources as needed, rather than investing in expensive and complex on-premises infrastructure.

2. Expanding Services: From its humble beginnings as a simple storage service, AWS has expanded its offerings to include a wide range of services, including compute, storage, databases, networking, analytics, machine learning, Internet of Things (IoT), security, and more.

3. Global Infrastructure: AWS has massively expanded its global infrastructure, with data centers in over 25 regions around the world, enabling businesses to easily reach their customers wherever they may be located.

4. Constant Innovation: AWS is constantly innovating and releasing new services and features, such as serverless computing, containers, artificial intelligence, and more. This allows businesses to stay ahead of the curve and adapt to changing market needs.

5. Hybrid and Multi-cloud Capabilities: To meet the diverse needs of its customers, AWS offers hybrid and multi-cloud capabilities, allowing businesses to seamlessly integrate on-premises infrastructure with AWS cloud services.

6. Cost Optimization: AWS has introduced several cost-saving features, such as auto-scaling, reserved instances, and spot instances, helping businesses reduce their overall IT costs.

7. Enterprise Focus: AWS has made significant efforts to cater to the needs of large enterprises, with services such as AWS Enterprise Support, AWS Managed Services, and AWS Control Tower, making it easier for businesses to adopt and manage cloud services at scale.

8. Educating the Market: AWS offers a variety of training and certification programs for individuals and businesses to learn and build expertise on its services. This has helped to create a broad base of skilled professionals who are well-versed in AWS technologies.

9. Partner Ecosystem: AWS has built a vast partner ecosystem, including Independent Software Vendors (ISVs), System Integrators (SIs), Managed Service Providers (MSPs), and Consulting Partners, offering businesses a wide range of options to meet their specific needs.

10. Impact on other Cloud Providers: AWS's success has not only revolutionized the cloud computing market, but it has also pushed other cloud providers to innovate and improve their services, benefiting customers across the industry.</p>

**CHAIN answer questions**

In [56]:
print()
#print('chain to answer questions')
print("-" * 80)
print()
result = qa({"query": query})
print(f'Query: {result["query"]}\n')
print(f'Result: {result["result"]}\n')
print(f'Context Documents: ')
for srcdoc in result["source_documents"]:
      print(f'{srcdoc}\n')
print("-" * 80)


--------------------------------------------------------------------------------

Query: How AWS has evolved?

Result:  AWS has evolved into an $85B annual revenue run rate business with strong profitability that has transformed how customers manage their technology infrastructure. 

Context Documents: 
page_content='customersmuch more functionality in AWS than they can find anywhere else (which is a significant differentiator), butalso allowed us to arrive at the much more game-changing offering that AWS is today.' metadata={'year': 2021, 'source': 'AMZN-2021-Shareholder-Letter.pdf'}

page_content='We had a head start on potential competitors;and if anything, we wanted to accelerate our pace of innovation. We made the long-term decision tocontinue investing in AWS. Fifteen years later, AWS is now an $85B annual revenue run rate business, withstrong profitability, that has transformed how customers from start-ups to multinational companies to publicsector organizations manage their te