<a href="https://colab.research.google.com/github/frank-morales2020/MLxDL/blob/main/GPT_LANGCHAIN_POSTGRESQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DEPENDENCIES

In [46]:
!nvidia-smi

Wed May  1 11:01:16 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   57C    P8              17W /  72W |      4MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
#added by Frank Morales(FM) 22/02/2024
%pip install openai  --root-user-action=ignore
!pip install llama_index phoenix pyvis network
!pip install llama_hub
%pip install colab-env --upgrade --quiet --root-user-action=ignore
!pip install accelerate
#!pip install typing_extensions

!pip install langchain --quiet
!pip install accelerate --quiet
!pip install transformers --quiet
!pip install bitsandbytes --quiet

# CPU CONFIGURATION

In [None]:
!git clone https://github.com/ultralytics/yolov5  # clone
%cd yolov5
%pip install -qr requirements.txt comet_ml

In [None]:
import torch
import utils
display = utils.notebook_init()  # checks

# DATA LOADER - TEXT FILES

State of the union

In [None]:
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

%cd /content/

!git clone https://github.com/hwchase17/chat-your-data.git
from langchain.document_loaders import UnstructuredFileLoader

#loader = UnstructuredFileLoader("/content/chat-your-data/state_of_the_union.txt")
loader = TextLoader("/content/chat-your-data/state_of_the_union.txt")
#/content/chat-your-data/state_of_the_union.txt
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs2 = text_splitter.split_documents(documents)

collection_name0 = "state_of_the_union"
print(f'# of Document Pages {len(documents)}')
print(f'# of Document Chunks: {len(docs2)}')

Paul Graham Essay

In [None]:
!git clone https://github.com/dbredvick/paul-graham-to-kindle.git

In [None]:
import colab_env
import openai
import os
openai.api_key = os.getenv("OPENAI_API_KEY")


import llama_index.core.readers as readers
reader = readers.SimpleDirectoryReader(input_files=["/content/pg_essay.txt"])
## for the RAG
docs0 = reader.load_data()

collection_name0 = "pg_essay"
print(f'# of Document Pages {len(docs0)}')
print(f'# of Document Chunks: {len(docs0)}')


print()
print()

loader = TextLoader("/content/pg_essay.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
### for the DB embedding
docs0 = text_splitter.split_documents(documents)

collection_name0 = "pg_essay"
print(f'# of Document Pages {len(documents)}')
print(f'# of Document Chunks: {len(docs0)}')



# POSTGRESQL

In [None]:
# https://python.langchain.com/docs/integrations/vectorstores/pgembedding

# install PSQL WITH DEV Libraries AND PGVECTOR
!apt install postgresql postgresql-contrib &>log
!service postgresql restart
!sudo apt install postgresql-server-dev-all

In [None]:
%cd /content/gdrive/MyDrive/tools/pgvector
!cp -pr /content/gdrive/MyDrive/tools/pgvector /content/
%cd /content/pgvector/
print()
print('START: PG VECTOR COMPILATION')
!make
!make install # may need sudo
print('END: PG VECTOR COMPILATION')
print()

%cd /content/
!git clone https://github.com/neondatabase/pg_embedding.git
%cd /content/pg_embedding
print()
print('START: PG embedding COMPILATION')
!make
!make install # may need sudo
print('END: PG embedding COMPILATION')
print()

#!ls /usr/share/postgresql/14/extension/*control*

In [None]:
import psycopg2 as ps

# PostGRES SQL Settings
%cd /content/
!sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'postgres'"

#!sudo -u postgres psql -c "DROP EXTENSION embedding"
!sudo -u postgres psql -c "CREATE EXTENSION embedding"

!sudo -u postgres psql -c "DROP TABLE documents"
!sudo -u postgres psql -c "CREATE TABLE documents(id integer PRIMARY KEY, embedding real[])"

h="{0,1,2}"
hh= "INSERT INTO documents(id, embedding) VALUES (1,'%s'), (2,'{1,2,3}'),  (3,'{1,1,1}')"%h
print(hh)

def insert_document(id,embedding):
    #review_embedding=get_embedding(text)
    ### INSERT INTO DB
    DB_NAME = "postgres"
    DB_USER = "postgres"
    DB_PASS = "postgres"
    DB_HOST = "localhost"
    DB_PORT = "5432"
    conn = ps.connect(database=DB_NAME,
							user=DB_USER,
							password=DB_PASS,
							host=DB_HOST,
							port=DB_PORT)


    cur = conn.cursor() # creating a cursor

    cur.execute("""
        INSERT INTO documents
        (id, embedding)
        VALUES ('%s',
                '%s')""" % (id,embedding))

    conn.commit()
    print("INSERT EMBEDDING %s successfully"%embedding)
    conn.close()
    cur.close()


insert_document(1,'{0,1,2}')
insert_document(2,"{1,2,3}")
insert_document(3,"{1,1,1}")

!sudo -u postgres psql -c "CREATE INDEX ON documents USING hnsw(embedding) WITH (dims=3, m=8, efconstruction=16, efsearch=16)"
#!sudo -u postgres psql -c "CREATE INDEX ON documents USING hnsw(embedding) WITH (dims=3, m=3, efconstruction=5, efsearch=5)"
!sudo -u postgres psql -c "SET enable_seqscan = off"

ARRAY = [3, 3, 3]

def select_document(HNSW_index):
    DB_NAME = "postgres"
    DB_USER = "postgres"
    DB_PASS = "postgres"
    DB_HOST = "localhost"
    DB_PORT = "5432"
    conn = ps.connect(database=DB_NAME,
							user=DB_USER,
							password=DB_PASS,
							host=DB_HOST,
							port=DB_PORT)

    cur = conn.cursor() # creating a cursor

    cur.execute("""
    SELECT id FROM documents
    ORDER BY embedding %s ARRAY[%s,%s,%s] LIMIT 1
    """ % (HNSW_index,str(ARRAY[0]), str(ARRAY[1]), str(ARRAY[2])))

    conn.commit()
    print(cur.fetchone())
    #print("INSERT EMBEDDING %s successfully"%embedding)
    conn.close()
    cur.close()

# <->, <=>, and <~> operators define the distance metric, which calculates the distance between the query vector and each row of the dataset.
select_document('<->')
select_document('<=>')
select_document('<~>')

# DATA LOADER - PDF FILES

AMAZON Shareholder Letters

In [29]:
!mkdir -p /content/data

from urllib.request import urlretrieve
urls = [
    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2021/ar/Amazon-2020-Shareholder-Letter-and-1997-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2020/ar/2019-Shareholder-Letter.pdf'
]

filenames = [
    'AMZN-2022-Shareholder-Letter.pdf',
    'AMZN-2021-Shareholder-Letter.pdf',
    'AMZN-2020-Shareholder-Letter.pdf',
    'AMZN-2019-Shareholder-Letter.pdf'
]

metadata = [
    dict(year=2022, source=filenames[0]),
    dict(year=2021, source=filenames[1]),
    dict(year=2020, source=filenames[2]),
    dict(year=2019, source=filenames[3])]

data_root = "/content/data/"

for idx, url in enumerate(urls):
    file_path = data_root + filenames[idx]
    urlretrieve(url, file_path)

In [30]:
from pypdf import PdfReader, PdfWriter
import glob

local_pdfs = glob.glob(data_root + '*.pdf')

for local_pdf in local_pdfs:
    pdf_reader = PdfReader(local_pdf)
    pdf_writer = PdfWriter()
    for pagenum in range(len(pdf_reader.pages)-3):
        page = pdf_reader.pages[pagenum]
        pdf_writer.add_page(page)

    with open(local_pdf, 'wb') as new_file:
        new_file.seek(0)
        pdf_writer.write(new_file)
        new_file.truncate()

In [32]:
import numpy as np
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader

documents = []

for idx, file in enumerate(filenames):
    loader = PyPDFLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        document_fragment.metadata = metadata[idx]

    documents += document

# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 512,
    chunk_overlap  = 100,
)

#text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=100)
### for the DB embedding
docs = text_splitter.split_documents(documents)


docs = text_splitter.split_documents(documents)

print(f'# of Document Pages {len(documents)}')
print(f'# of Document Chunks: {len(docs)}')

collection_name = "AWS"

# of Document Pages 25
# of Document Chunks: 299


# OPENAI - SETTINGS

In [33]:
import warnings
warnings.filterwarnings('ignore')

import colab_env
import openai
import os
openai.api_key = os.getenv("OPENAI_API_KEY")

from openai import OpenAI
client = OpenAI()

In [34]:
def gpt_reponse(query):
  response = client.chat.completions.create(
    model="gpt-4",
    #model="gpt-3.5-turbo"
    #response_format={ "type": "json_object" },
    messages=[
      #{"role": "system", "content": "You are a helpful assistant designed to output JSON."},
      {"role": "system", "content": "You are a helpful assistant designed to output text."},
      {"role": "user", "content": query}
    ]
  )

  return response

In [35]:
query = "Who won the world series in 2009 and who lost, explained?, who were the managers?"
response=gpt_reponse(query)

print()
print("-" * 80)
print('Question: %s'%query)
print("-" * 80)
print('Answer: %s'%response.choices[0].message.content)
print("-" * 80)


--------------------------------------------------------------------------------
Question: Who won the world series in 2009 and who lost, explained?, who were the managers?
--------------------------------------------------------------------------------
Answer: The 2009 World Series was won by the New York Yankees, defeating the Philadelphia Phillies. The Yankees secured their 27th championship title, winning the series 4 games to 2.

The manager of the New York Yankees in 2009 was Joe Girardi. Having joined the team in 2008, Girardi led the Yankees to their first World Series win since 2000. His strategy, leadership, and decision-making played a significant role in their success.

On the other side, the Philadelphia Phillies were managed by Charlie Manuel. Despite losing the series, Manuel was at the helm of the Phillies during a successful period in their history. In 2008, he had led the team to a World Series victory.

Both teams showed exceptional quality and resilience, deliverin

# GPT4 MODEL - EXAMPLE

In [36]:
query = "How AWS has evolved?"
response=gpt_reponse(query)
print("-" * 80)
print('Question: %s'%query)
print("-" * 80)
print('Answer: %s'%response.choices[0].message.content)
print("-" * 80)

--------------------------------------------------------------------------------
Question: How AWS has evolved?
--------------------------------------------------------------------------------
Answer: Amazon Web Services (AWS) has undergone significant evolution since its inception in 2006, growing into the most comprehensive, broadly adopted, and mature cloud platform globally. Here's a brief overview of AWS's evolution:

1. Launch Phase (2006-2010): AWS launched with just three services -- Amazon Simple Queue Service (SQS), Amazon Simple Storage Service (S3), and Amazon Elastic Compute Cloud (EC2).

2. Expansion Phase (2010-2013): AWS expanded rapidly, launching new services like Elastic Load Balancing, Auto Scaling, and Amazon Relational Database Service (RDS). Amazon also introduced Elastic Beanstalk, a platform as a service (PaaS), and CloudFormation, an IaaS service.

3. Global Expansion (2013-2016): AWS began expanding globally, opening regions in Asia (Tokyo, Sydney, Beijing) a

In [37]:
query = "I bought an ice cream for 6 kids. Each cone was $1.25 and I paid with a $10 bill. How many dollars did I get back? Explain first before answering."
response=gpt_reponse(query)

print()
print("-" * 80)
print('Question: %s'%query)
print("-" * 80)
print('Answer: %s'%response.choices[0].message.content)
print("-" * 80)


--------------------------------------------------------------------------------
Question: I bought an ice cream for 6 kids. Each cone was $1.25 and I paid with a $10 bill. How many dollars did I get back? Explain first before answering.
--------------------------------------------------------------------------------
Answer: First, to find the total cost of the ice cream cones, you need to multiply the price of each cone, $1.25, by the number of kids, which is 6. So, $1.25 times 6 equals $7.50, which is the total cost of the cones.

Next, you need to calculate how much change you should get back. You do this by taking the amount you paid, $10, and subtracting the total cost of the cones, $7.50.

So, $10 - $7.50 = $2.50  

So you got back $2.50 in change after buying ice cream cones for 6 kids.
--------------------------------------------------------------------------------


# EMBEDDING - WITH PDF FILES

In [38]:
# 20x faster than pgvector: introducing pg_embedding extension for vector search in Postgres and LangChain
# https://neon.tech/blog/pg-embedding-extension-for-vector-search

#ADDED By FM 22/02/2024

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import PGEmbedding

# https://supabase.com/blog/fewer-dimensions-are-better-pgvector
embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')

collection_name='AWS'
connection_string = os.getenv("DATABASE_URL")

import llama_index.core.readers as readers
reader = readers.SimpleDirectoryReader(input_files=["/content/pg_essay.txt"])
## for the RAG
#docs = reader.load_data()

### FOR PDF FILES
db = PGEmbedding.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=collection_name,
    connection_string=connection_string,
)

#db.create_hnsw_index(dims = 1536, m = 8, ef_construction = 16, ef_search = 16)
#!sudo -u postgres psql -c "CREATE INDEX ON documents USING hnsw(embedding) WITH (dims=3, m=8, efconstruction=16, efsearch=16)"


In [39]:
#ADDED By FM 22/02/2024

from typing import List, Tuple
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import PGEmbedding

query='AI'

docs_with_score: List[Tuple[Document, float]] = db.similarity_search_with_score(query)

print()
print(query)
print()

for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)


AI

--------------------------------------------------------------------------------
Score:  0.61912847
drones for Prime Air,to Alexa, to the many machine learning services AWS offers (where AWS has the broadest machine learningfunctionality and customer base of any cloud provider). More recently, a newer form of machine learning,called Generative AI, has burst onto the scene and promises to significantly accelerate machine learningadoption. Generative AI is based on very Large Language Models (trained on up to hundreds of billionsof parameters, and growing), across expansive datasets, and has radically
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.61912847
drones for Prime Air,to Alexa, to the many machine learning services AWS offers (where AWS has the broadest machine learningfunctionality and customer base of any cloud provider). More recently, a newer form

# LANG CHAIN

In [40]:
import torch
from textwrap import fill
from IPython.display import Markdown, display

from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
    )

from langchain import PromptTemplate
from langchain import HuggingFacePipeline

from langchain.vectorstores import Chroma
from langchain.schema import AIMessage, HumanMessage
from langchain.memory import ConversationBufferMemory
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredMarkdownLoader, UnstructuredURLLoader
from langchain.chains import LLMChain, SimpleSequentialChain, RetrievalQA, ConversationalRetrievalChain
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline
import warnings
warnings.filterwarnings('ignore')

**Create a chain to answer questions**

In [41]:
openai.__version__

'1.25.0'

In [42]:
from langchain.llms import OpenAI

retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":2})

# create a chain to answer questions
qa = RetrievalQA.from_chain_type(
     llm=OpenAI(model_name='gpt-3.5-turbo-instruct',max_tokens=1512,temperature=0.9), chain_type="stuff", retriever=retriever, return_source_documents=True)


**MODEL answer questions**

In [43]:
#MODEL	DESCRIPTION	CONTEXT WINDOW	TRAINING DATA
#gpt-3.5-turbo-0125
#gpt-3.5-turbo-instruct
# https://platform.openai.com/docs/models/overview
# https://pypi.org/project/openai/

from langchain.llms import OpenAI


#llm=OpenAI(model='gpt-4',max_tokens=1512,temperature=0.9)
llm=OpenAI(model='gpt-3.5-turbo-instruct',max_tokens=1512,temperature=0.9)


#from langchain_community.llms import OpenAI as OpenAIv1
#llm = OpenAIv1(model_name="gpt-4")


print()
print("-" * 80)
query = "How AWS has evolved?"
#query = "How many AI publications in 2022?"
result = llm(query)
display(Markdown(f"<p>{query}</p>"))
print("-" * 80)
display(Markdown(f"<p>{result}</p>"))


--------------------------------------------------------------------------------


<p>How AWS has evolved?</p>

--------------------------------------------------------------------------------


<p>

1. Increased Services and Capabilities: When AWS was first launched in 2006, it offered only 3 services. Today, it offers over 175 services and continues to add more. These services range from computing, storage, networking, databases, analytics, AI/ML, IoT, security, and many more.

2. Global Infrastructure: AWS has expanded its infrastructure globally with the addition of multiple regions and availability zones. This allows customers to have their services hosted in different regions for better redundancy and improved performance.

3. Hybrid Cloud Capabilities: AWS has evolved to support hybrid cloud architectures, where customers can connect their on-premises infrastructure to the AWS cloud. This allows organizations to have a mix of both on-premises and cloud environments, giving them more flexibility in managing their workloads.

4. Industry-Specific Solutions: AWS has developed solutions tailored for specific industries such as healthcare, finance, government, education, and more. These solutions help organizations in these industries comply with regulations and address their specific needs.

5. AI and ML Capabilities: AWS has heavily invested in AI and ML capabilities, making it easier for customers to build and deploy machine learning models. This has opened up new opportunities for businesses to use AI and ML in their applications and services.

6. Serverless Computing: AWS has introduced serverless computing through its services such as AWS Lambda, which allows customers to run code without managing servers. This has greatly reduced the complexity and cost of managing servers for organizations.

7. DevOps Integration: AWS has integrated its services with DevOps tools, allowing customers to automate their software delivery processes. This has helped organizations to accelerate their software development and deployment cycles.

8. Focus on Security: AWS has put a strong emphasis on security and compliance, offering a wide range of security services and compliance programs to help customers secure their applications and data on the cloud.

9. Partner Ecosystem: AWS has built a strong partner ecosystem, with thousands of technology and consulting partners offering solutions and services on top of AWS. This has helped customers to easily find and implement the solutions they need for their business.

10. User-Friendly Interfaces: AWS has evolved its user interface to be more user-friendly and intuitive, making it easier for customers to manage and monitor their services on the cloud.</p>

**CHAIN answer questions**

In [44]:
print()
#print('chain to answer questions')
print("-" * 80)
print()
result = qa({"query": query})
print(f'Query: {result["query"]}\n')
print(f'Result: {result["result"]}\n')
print(f'Context Documents: ')
for srcdoc in result["source_documents"]:
      print(f'{srcdoc}\n')
print("-" * 80)


--------------------------------------------------------------------------------

Query: How AWS has evolved?

Result:  AWS has evolved by offering customers more functionality and becoming a game-changing platform. 

Context Documents: 
page_content='customersmuch more functionality in AWS than they can find anywhere else (which is a significant differentiator), butalso allowed us to arrive at the much more game-changing offering that AWS is today.' metadata={'year': 2021, 'source': 'AMZN-2021-Shareholder-Letter.pdf'}

page_content='customersmuch more functionality in AWS than they can find anywhere else (which is a significant differentiator), butalso allowed us to arrive at the much more game-changing offering that AWS is today.' metadata={'year': 2021, 'source': 'AMZN-2021-Shareholder-Letter.pdf'}

--------------------------------------------------------------------------------
