# Prepare environment

## Configure jupyter

* Install libs

In [1]:
%%capture
# Prepare graphing capabilities
!pip install plotly matplotlib
# Enable functions that contain % and %%
!pip install ipython-sql
# Enable easy UI
!pip install gradio

# Create embeddings

* Install libs
* Read env from https://github.com/frtu/jupyter-workbench/blob/master/docker/jupyter-llm/docker-compose.yml#L22

In [2]:
%%capture

!pip install --upgrade tiktoken
!pip install --upgrade langchain

In [3]:
!pip show langchain

[0mName: langchain
Version: 0.0.202
Summary: Building applications with LLMs through composability
Home-page: https://www.github.com/hwchase17/langchain
Author: 
Author-email: 
License: MIT
Location: /opt/conda/lib/python3.10/site-packages
Requires: aiohttp, async-timeout, dataclasses-json, langchainplus-sdk, numexpr, numpy, openapi-schema-pydantic, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: 


## Prepare data processing

In [4]:
def normalize_text(text):
    return " ".join(text.split())

normalize_text("""
Apple is a corporate structure
 that
 
 is famous
""")

'Apple is a corporate structure that is famous'

In [None]:
import tiktoken

# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191

encoding = tiktoken.get_encoding(embedding_encoding)

def get_token(text):
    filtered_text = normalize_text(text)
    return encoding.encode(filtered_text)

token = get_token("""
Apple is a corporate structure
 that
 
 is famous
""")

len(token)

## Init openai & data processing

In [22]:
%%capture
# https://platform.openai.com/docs/libraries
!pip install --upgrade openai

In [23]:
import openai  # for calling the OpenAI API

# Load your API key from an environment variable or secret management service
openai.api_key = os.getenv("OPENAI_API_KEY")

In [25]:
from langchain.embeddings.openai import OpenAIEmbeddings

texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here'
]

embed = OpenAIEmbeddings(
    openai_api_key=openai.api_key
)
# Using openai.Embedding syntax
def get_embedding(text, model="text-embedding-ada-002"):
    filtered_text = normalize_text(text)
    return embed.embed_documents(filtered_text)

# Using openai.Embedding syntax
def get_embed_query(text, model="text-embedding-ada-002"):
    filtered_text = normalize_text(text)
    return embed.embed_query(filtered_text)

# embeddings = get_embed_query('this is the first chunk of text', model='text-embedding-ada-002')
# len(embeddings)

# Document manipulation

## Loading PDF

In [13]:
%%capture
!pip install PGVector # PostgreSQL Vectors

In [36]:
from langchain.vectorstores.pgvector import PGVector, DistanceStrategy

connection_string = PGVector.connection_string_from_db_params(
    driver=os.getenv('DRIVER', 'psycopg2'),
    host=os.getenv('DB_HOST', 'database'),
    port=os.getenv('DB_PORT', '5432'),
    database=os.getenv('DB_DATABASE', 'db'),
    user=os.getenv('DB_USER', 'admin'),
    password=os.getenv('DB_PASSWORD', 'admin')
)

In [37]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

loader = TextLoader('/data/data/state_of_the_union.txt')
documents = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
print(len(documents))
print(len(docs))

1
45


In [38]:
from typing import List, Tuple
from langchain.embeddings.openai import OpenAIEmbeddings

collection_name = 'state_of_the_union'

# The PGVector Module will try to create a table with the name of the collection. 
# So, make sure that the collection name is unique and the user has the permission to create a table.
db = PGVector.from_documents(
    embedding=embed,
    documents=docs,
    collection_name=collection_name,
    connection_string=connection_string,
)

In [41]:
from langchain.docstore.document import Document

query = "What did the president say about federal deficit?"
docs_with_score: List[Tuple[Document, float]] = db.similarity_search_with_score(query)

In [40]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print(doc.metadata)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.5294667538679643
In the last two years, my administration cut the deficit by more than $1.7 trillion – the largest deficit reduction in American history.

Under the previous administration, America’s deficit went up four years in a row.

Because of those record deficits, no president added more to the national debt in any four years than my predecessor.

Nearly 25% of the entire national debt, a debt that took 200 years to accumulate, was added by that administration alone.

How did Congress respond to all that debt?

They lifted the debt ceiling three times without preconditions or crisis.

They paid America’s bills to prevent economic disaster for our country.

Tonight, I’m asking this Congress to follow suit.

Let us commit here tonight that the full faith and credit of the United States of America will never, ever be questioned.

Some of my Republican friends want to take the economy hostage 

In [42]:
store = PGVector(
    connection_string=connection_string, 
    embedding_function=embeddings, 
    collection_name='state_of_the_union',
    distance_strategy=DistanceStrategy.COSINE
)

retriever = store.as_retriever(search_kwargs={"k": 1})

In [None]:
retriever.get_relevant_documents('What did the president say about federal deficit?')