<a href="https://colab.research.google.com/github/acdc-digital/acdc.cooksite/blob/master/colab_files/solomon_chat_v3_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sv3 is an early proof-of-concept.
***

**solomon-v3:**

introduces a promising work-around for embedding rate-limits using a throttled parallel processing strategy.

Solomonv2 can ingest most combinations of document volumes, and adequately prepare the CSV. Where it failed, was in generating the new embeddings for relevant search and integrating those embeddings back into the original df for further analysis.

Early version testing for v3 will include large volumes of ingested .pdf's to calculate how smoothly it can generate embeddings without any of the typical warning signs.

It's likely solo-v2 will be used as the initial site model, used for simple file uploads and Q/A. We're still investigating some advanced features, but generally the application seems to be working well enough, and with an expansive server it should run fine concurrently, given there's enough compute power.

testing with openai@gmail api key: (v3-testing) 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


**initial benchmarking:**

/ingestion pages: 1801 /ingestion time: 4-minutes

// embeddings: 45-seconds /convert back df: 45-seconds
***

# Code

i. Each iteration of the loop reads a PDF file, extracts the necessary information, and appends it as a dictionary to the pdf_data list. After the loop, this list of dictionaries is converted into a DataFrame, where each dictionary corresponds to a row and the keys correspond to the column names.

In [None]:
! pip install "deeplake[enterprise]"
! pip install langchain
! pip install openai
! pip install PyPDF2
! pip install retrying
! pip install tiktoken

In [None]:
## CHUNKING LOGIC

import PyPDF2
import os
import pandas as pd

# Initialize an empty list to hold the PDF data
pdf_data = []

# Loop through each PDF file in the directory
for filename in os.listdir('/content/source_docs'):
    if filename.endswith('.pdf'):
        pdf_file_path = os.path.join('/content/source_docs', filename)

        # Get file size
        file_size = os.path.getsize(pdf_file_path)

        # Read the PDF file
        with open(pdf_file_path, 'rb') as f:
            pdf_reader = PyPDF2.PdfReader(f)
            text_content = ''

            # Loop through each page and extract text
            for page_num in range(len(pdf_reader.pages)):
                page = pdf_reader.pages[page_num]
                text_content += page.extract_text()

        # Generate a summary (you might want to replace this with a real summary)
        summary = text_content[:100]  # First 100 characters as a placeholder summary

        # Generate a chunk (you might want to replace this with a real chunk)
        chunk = text_content[:500]  # First 500 characters as a placeholder chunk

        # Number of pages
        number_of_pages = len(pdf_reader.pages)

        # Append to the list
        pdf_data.append({
            'filename': filename,
            'title_or_heading': filename,  # Placeholder, replace as needed
            'content_summary': summary,
            'file_size': file_size,
            'number_of_pages': number_of_pages
        })

# Convert the list to a DataFrame
article_df = pd.DataFrame(pdf_data)

# To display the column names
print(article_df.columns)

# Chunking Logic

def chunk_text(row, text_list):
    title = row['filename']
    file_body_string = row['content_summary']

    # Your chunking logic here
    # For demonstration, I'm using simple string slicing to create chunks
    for i in range(0, len(file_body_string), 500):  # 500 is the chunk size
        chunk = file_body_string[i:i+500]
        id = f"{title}-chunk-{i//500}"
        text_list.append({
            'id': id,
            'metadata': {
                "filename": title,
                "content": chunk,
                "file_chunk_index": i//500
            }
        })

# Initialize an empty list to hold the chunked text data
chunked_text_data = []

# Apply the chunk_text function to each row of the DataFrame
article_df.apply(lambda row: chunk_text(row, chunked_text_data), axis=1)

# Convert the list to a DataFrame (if needed)
chunked_text_df = pd.DataFrame(chunked_text_data)

# Display the chunked text DataFrame
print(chunked_text_df)

Index(['filename', 'title_or_heading', 'content_summary', 'file_size',
       'number_of_pages'],
      dtype='object')
                                                   id  \
0   Hands-On Machine Learning with Scikit-Learn an...   
1           The Long-Document Transformer.pdf-chunk-0   
2                              HYDE-embed.pdf-chunk-0   
3                   Open-Financial-Models.pdf-chunk-0   
4                        Sherlock-No-Dice.pdf-chunk-0   
5                  AI-Evolution-Education.pdf-chunk-0   
6                      public-AI-Concerns.pdf-chunk-0   
7                       HighPerformanceAI.pdf-chunk-0   
8                         Mixture_Experts.pdf-chunk-0   
9                Contrast-Context-Scaling.pdf-chunk-0   
10                        topic_modelling.pdf-chunk-0   
11                  WebEnhanced-Retrieval.pdf-chunk-0   
12                  Modelling-Bias-Tuning.pdf-chunk-0   
13                        Prompt-You-Need.pdf-chunk-0   
14  What_Is_ChatGPT_Doing

In [None]:
# We'll use 1000 token chunks with some intelligence to not split at the end of a sentence
TEXT_EMBEDDING_CHUNK_SIZE = 1000
EMBEDDINGS_MODEL = "text-embedding-ada-002"

openai.api_key = "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"
print("API Key:", openai.api_key)
print("Embeddings Model:", EMBEDDINGS_MODEL)

API Key: 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
Embeddings Model: text-embedding-ada-002


In [None]:
from tenacity import retry, wait_random_exponential, stop_after_attempt
from typing import List
import openai  # Make sure to install the OpenAI Python package
import concurrent.futures
from tqdm import tqdm
import tiktoken  # Make sure to install the tiktoken Python package
import os

# Simple function to take in a list of text objects and return them as a list of embeddings
@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(10))
def get_embeddings(input: List):
    print("Getting embeddings for input:", input[:3])  # Print first 3 elements for debugging
    response = openai.Embedding.create(
        input=input,
        model=EMBEDDINGS_MODEL,
    )["data"]
    embeddings = [data["embedding"] for data in response]
    print("Generated embeddings:", embeddings[:3])  # Print first 3 embeddings for debugging
    return embeddings

def batchify(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx : min(ndx + n, l)]

# Function for batching and parallel processing the embeddings
def embed_corpus(
    text_list,  # List containing chunked text data
    batch_size=64,
    num_workers=8,
    max_context_len=8191,
):
    print("embed_corpus function called")

    # Extract the 'content' from the 'metadata' dictionary and convert it to a list
    corpus = [text['metadata']['content'] for text in text_list]

    # Encode the corpus, truncating to max_context_len
    encoding = tiktoken.get_encoding("cl100k_base")
    encoded_corpus = [
        encoded_article[:max_context_len] for encoded_article in encoding.encode_batch(corpus)
    ]

    # Calculate corpus statistics: the number of inputs, the total number of tokens, and the estimated cost to embed
    num_tokens = sum(len(article) for article in encoded_corpus)
    cost_to_embed_tokens = num_tokens / 1_000 * 0.0004
    print(
        f"num_articles={len(encoded_corpus)}, num_tokens={num_tokens}, est_embedding_cost={cost_to_embed_tokens:.2f} USD"
    )

    # Embed the corpus

    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:

        futures = [
            executor.submit(get_embeddings, text_batch)
            for text_batch in batchify(encoded_corpus, batch_size)
        ]

        with tqdm(total=len(encoded_corpus)) as pbar:
            for _ in concurrent.futures.as_completed(futures):
                pbar.update(batch_size)

        embeddings = []
        for future in futures:
            data = future.result()
            embeddings.extend(data)
            print("Current embeddings length:", len(embeddings))  # Print the length of embeddings list

        print("Final embeddings:", embeddings[:3])  # Print first 3 embeddings for debugging
        return embeddings

# Move this line out of the embed_corpus function
embeddings = embed_corpus(text_list)

# Print the embeddings and DataFrame for debugging
print(embeddings)
print(chunked_text_df.head())

embed_corpus function called
num_articles=22, num_tokens=634, est_embedding_cost=0.00 USD
Getting embeddings for input: [[32, 324, 73511, 268, 480, 978, 2298, 96539, 67454, 2355, 22333, 21579, 2355, 4291, 2522, 61503, 8288, 10326, 2355, 5, 96086, 2355, 5910, 21752, 50, 11, 5257, 41363, 11, 3651, 220], [6720, 35627, 25, 578, 5843, 12, 7676, 63479, 198, 40, 89, 33993, 82770, 191, 50988, 469, 13, 32284, 191, 7098, 1543, 356, 57572, 191, 198, 80977, 10181, 220], [69933, 1082, 18811, 31361, 354, 43622, 20035, 838, 2085, 1050, 33194, 62096, 198, 43, 4168, 84, 480, 3524, 191, 88, 55, 361, 8890, 526, 11583, 191, 89, 86755, 8732, 89, 97750, 14751]]


64it [00:00, 121.93it/s]

Generated embeddings: [[-0.00854767207056284, -0.0090925432741642, 0.02863299660384655, -0.015760408714413643, 0.014983966015279293, 0.018198708072304726, 0.0023225147742778063, 0.0028520617634058, -0.024941492825746536, -0.017013613134622574, 0.03250158578157425, 0.016877394169569016, -0.015024831518530846, -0.013076916337013245, 0.010468343272805214, -0.0058914232067763805, 0.025908639654517174, 0.012654640711843967, 0.009262815117835999, -0.013492380268871784, -0.01937018148601055, 0.006197913084179163, 0.0073489542119205, -0.02277562953531742, 0.0037732350174337626, 0.0027192493434995413, 0.00937179010361433, -0.024818897247314453, 0.001569911022670567, -0.011122189462184906, 0.020241975784301758, -0.002799277426674962, 0.011844144202768803, -0.013056483119726181, -0.034435879439115524, -0.0005389119614847004, 0.019615374505519867, -0.0010599453235045075, -0.004689300432801247, -0.0033577706199139357, 0.024791652336716652, 0.027584118768572807, 0.004559893161058426, -0.027624985203




In [None]:
%%time
# Initialise tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")

# List to hold vectors
text_list = []

# Debugging print statement
print("Starting tokenization...")

# Process each PDF file and prepare for embedding
x = article_df.apply(lambda x: chunk_text(x, text_list), axis=1)

# Debugging print statement
print("Tokenization completed.")
print("Number of articles processed:", len(text_list))

Starting tokenization...
Tokenization completed.
Number of articles processed: 22
CPU times: user 1.42 ms, sys: 0 ns, total: 1.42 ms
Wall time: 1.39 ms


In [None]:
text_list[0]

{'id': 'Hands-On Machine Learning with Scikit-Learn and TensorFlow_ Concepts, Tools, and Techniques to Build Intelligent Systems - PDF Room.pdf-chunk-0',
 'metadata': {'filename': 'Hands-On Machine Learning with Scikit-Learn and TensorFlow_ Concepts, Tools, and Techniques to Build Intelligent Systems - PDF Room.pdf',
  'content': 'Aurélien GéronHands-On  \nMachine Learning  \nwith Scikit-Learn  \n& TensorFlow  \nCONCEPTS, TOOLS, AND ',
  'file_chunk_index': 0}}

In [None]:
# Join up embeddings with our original list
embeddings_list = [{"embedding": v} for v in embeddings]
for i,x in enumerate(embeddings_list):
    text_list[i].update(x)
text_list[0]

In [None]:
pip install --upgrade urllib3 pyopenssl

In [None]:
import os
import pandas as pd
from deeplake.core.vectorstore import VectorStore

# Set your ACTIVELOOP_TOKEN
os.environ['ACTIVELOOP_TOKEN'] = 'eyJhbGciOiJIUzUxMiIsImlhdCI6MTY5MDIwMDcxNCwiZXhwIjoxNzA0MDI4MjU5fQ.eyJpZCI6ImFjZGNkaWdpdGFsIn0.RwLAU6QDB2GrMGyu2XImbHajwsEpb6PMDe_IGQ8pzE4tEKCQHXUZCAdry4f9KUtt2eHktNpxBq7XI6AkDA9Mnw'

# Assuming text_list is your list of dictionaries containing 'metadata' and 'embedding'
chunked_text = [text['metadata']['content'] for text in text_list]
source_texts = [text['metadata']['filename'] for text in text_list]
precomputed_embeddings = [text['embedding'] for text in text_list]

# Initialize Vector Store with the Hub URL
vector_store_path = "hub://solomon/solov3-space-engine"
vector_store = VectorStore(
    path=vector_store_path,
)

# Add data to Vector Store
vector_store.add(
    text=chunked_text,
    embedding=precomputed_embeddings,
    metadata=[{"source": source_text} for source_text in source_texts]
)

Your Deep Lake dataset has been successfully created!


100%|██████████| 22/22 [00:00<00:00, 29.85it/s]
-

Dataset(path='hub://solomon/solov3-space-engine', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (22, 1)      str     None   
 metadata     json      (22, 1)      str     None   
 embedding  embedding  (22, 1536)  float32   None   
    id        text      (22, 1)      str     None   


 

In [None]:
import deeplake
print(deeplake.__version__)

3.6.26


In [None]:
from deeplake.core.vectorstore import VectorStore
import openai
import os

os.environ['OPENAI_API_KEY'] = "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"

vector_store_path = "hub://solomon/solov3-space-engine"

vector_store = VectorStore(
    path = vector_store_path,
    read_only = True
)

Deep Lake Dataset in hub://solomon/solov3-space-engine already exists, loading from the storage


In [None]:
def embedding_function(texts, model = "text-embedding-ada-002"):

   if isinstance(texts, str):
       texts = [texts]

   texts = [t.replace("\n", " ") for t in texts]
   return [data['embedding']for data in openai.Embedding.create(input = texts, model=model)['data']]

In [None]:
prompt = "What are the first programs he tried writing?"

search_results = vector_store.search(embedding_data=prompt,
                                     embedding_function=embedding_function)

print(search_results)
search_results['text'][0]

{'id': ['aadc9154-5746-11ee-9eeb-0242ac1c000c', 'aadc8a92-5746-11ee-9eeb-0242ac1c000c', 'aadc8b96-5746-11ee-9eeb-0242ac1c000c', 'aadc93fc-5746-11ee-9eeb-0242ac1c000c'], 'metadata': [{'source': 'AttentionisallyouNeed.pdf'}, {'source': 'Hands-On Machine Learning with Scikit-Learn and TensorFlow_ Concepts, Tools, and Techniques to Build Intelligent Systems - PDF Room.pdf'}, {'source': 'The Long-Document Transformer.pdf'}, {'source': 'dea_Makers__Personal_Perspectives_on_the_-_Stephen_Wolfram.pdf'}], 'text': ['Attention Is All You Need\nAshish Vaswani\x03\nGoogle Brain\navaswani@google.comNoam Shazeer\x03\nGoogle Brain', 'Aurélien GéronHands-On  \nMachine Learning  \nwith Scikit-Learn  \n& TensorFlow  \nCONCEPTS, TOOLS, AND ', 'Longformer: The Long-Document Transformer\nIz Beltagy\x03Matthew E. Peters\x03Arman Cohan\x03\nAllen Institute ', 'Idea Makers: Personal Perspectives on the Lives & Ideas of Some Notable People\nCopyright © 2016 Step'], 'score': [array(0.7519803, dtype=float32), arr

'Attention Is All You Need\nAshish Vaswani\x03\nGoogle Brain\navaswani@google.comNoam Shazeer\x03\nGoogle Brain'