<a href="https://colab.research.google.com/github/acdc-digital/acdc.cooksite/blob/master/colab_files/solomon_chat.v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Last Updated: September 6, 2023**

//updated text only to link reddit Sept 9, 2023.

This is a sample notebook specifically for developing NLP within Langchain.
The reddit thread with details can be found here: https://www.reddit.com/r/LangChain/comments/16cm7g1/rate_my_pipe_colab_notebook_incl/

In [None]:
! pip install datetime
! pip install -U deeplake
! pip install "deeplake[enterprise]"
! pip install langchain
! pip install layoutparser[layoutmodels,tesseract]
! pip install multiprocessing
! pip install nltk
! pip install openai
! pip install pydantic
! pip install requests
! pip install sentence_transformers
! pip install tiktoken
! pip install transformers
! pip install tqdm
! pip install unstructured

In [None]:
! pip show openai

Name: openai
Version: 0.28.0
Summary: Python client library for the OpenAI API
Home-page: https://github.com/openai/openai-python
Author: OpenAI
Author-email: support@openai.com
License: 
Location: /usr/local/lib/python3.10/dist-packages
Requires: aiohttp, requests, tqdm
Required-by: 


***
**Ingestion document is llama2 whitepaper: https://arxiv.org/abs/2307.09288 @ 77 pages/.pdf/13.7mb**

**i. local CPU/ Full ingestion w/ model downloads on MBP 14" M1 Pro is approx. 7-minutes to ingestion loaded. Generating the embeddings, then transformation/ingestion for the vector-db are about 3mins each.**

**ii. Colab-Pro w/ High-Ram CPU Option Full Ingestion w/ model downloads on MBP 14" M1 Pro is approx. 3-minutes to ingestion loaded. Generating embeddings, then transformation/ingestion for hte vector-db are about 1min-19seconds each.**

In [None]:
import os
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import UnstructuredFileLoader, UnstructuredAPIFileLoader
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.prompts import PromptTemplate
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter, TokenTextSplitter
from langchain.vectorstores import DeepLake
from multiprocessing import Pool
from tqdm import tqdm
from transformers import BertTokenizer, BertModel

In [None]:
# Set up API keys & environment variables
os.environ['OPENAI_API_KEY'] = 'openai-api-key-here'
os.environ['ACTIVELOOP_TOKEN'] = 'activeloop-token-here'
os.environ["TOKENIZERS_PARALLELISM"] = "true"

# Configuration
my_activeloop_org_id = "acdcdigital"
my_activeloop_dataset_name = "solomon-chatset"
vector_store_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
file_path = "/content/sample_data/llama2-opensource.pdf"
model_name = "distilbert-base-uncased"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}

# Embedding function using HuggingFace
embedding_function = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

def load_documents(file_path):
    loader = UnstructuredFileLoader(
        file_path,
        model="elements",
        strategy="fast",
    )

    print(f"Loading documents from {file_path}")
    documents = list(tqdm(loader.load(), desc="Loading documents"))
    return list(documents)

def split_text_into_chunks(documents):
    text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", "(?<=\. )", " ", ""],
    chunk_size=2000,
    chunk_overlap=50,
    length_function=len
    )
    with Pool() as pool:
        chunks_list = pool.map(text_splitter.split_text, [doc.page_content for doc in documents])
    return [chunk for sublist in chunks_list for chunk in sublist]

def split_chunks_into_tokens(chunks):
    token_text_splitter = TokenTextSplitter(chunk_size=2000, chunk_overlap=50)
    tokens_list = [token_text_splitter.split_text(chunk) for chunk in chunks]
    return [token for sublist in tokens_list for token in sublist]

def generate_embeddings(tokens):
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    model = BertModel.from_pretrained("bert-base-uncased")
    token_embeddings = []
    for token in tqdm(tokens, desc="Generating embeddings"):
        inputs = tokenizer(token, return_tensors="pt", truncation=True, padding=True, max_length=512)
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1).detach().numpy().tolist()
        token_embeddings.extend(embeddings)
    return token_embeddings

# Load and process documents
documents = load_documents(file_path)
chunks = split_text_into_chunks(documents)
tokens = split_chunks_into_tokens(chunks)
token_embeddings = generate_embeddings(tokens)

# Load the vector store and add embeddings
db = DeepLake(dataset_path=vector_store_path, embedding=embedding_function, read_only=False)
documents = [Document(page_content=token, embedding=embedding) for token, embedding in zip(tokens, token_embeddings)]
db.add_documents(documents)

# Initialize the RetrievalQA chain and run a query
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-3.5-turbo"),
    chain_type="stuff",
    retriever=db.as_retriever()
)

# Print the number of loaded documents, chunks, tokens, and embeddings
print(f"Number of loaded documents: {len(documents)}")
print(f"Number of text chunks: {len(chunks)}")
print(f"Number of tokens: {len(tokens)}")
print(f"Number of generated embeddings: {len(token_embeddings)}")
print(documents[0])

Downloading (…)05411/.gitattributes:   0%|          | 0.00/445 [00:00<?, ?B/s]

Downloading (…)5d1cf5c05411/LICENSE:   0%|          | 0.00/11.4k [00:00<?, ?B/s]

Downloading (…)1cf5c05411/README.md:   0%|          | 0.00/8.58k [00:00<?, ?B/s]

Downloading (…)f5c05411/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)05411/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)1cf5c05411/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



Loading documents from /content/sample_data/llama2-opensource.pdf


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
Loading documents: 100%|██████████| 1/1 [00:00<00:00, 15033.35it/s]


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Generating embeddings: 100%|██████████| 156/156 [01:07<00:00,  2.30it/s]


Your Deep Lake dataset has been successfully created!


Creating embedding data: 100%|██████████| 30/30 [00:33<00:00,  1.11s/it]
|

Dataset(path='hub://acdcdigital/solomon-chatset', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (156, 1)     str     None   
 metadata     json      (156, 1)     str     None   
 embedding  embedding  (156, 768)  float32   None   
    id        text      (156, 1)     str     None   
Number of loaded documents: 156
Number of text chunks: 156
Number of tokens: 156
Number of generated embeddings: 156
page_content='Llama 2: Open Foundation and Fine-Tuned Chat Models\n\n3 2 0 2\n\nl u J\n\nHugo Touvron∗ Louis Martin† Kevin Stone† Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou Hakan In

 

In [None]:
db = DeepLake(dataset_path=vector_store_path, embedding=embedding_function, read_only=False)

retriever = db.as_retriever()
retriever.search_kwargs["distance_metric"] = "cos"
retriever.search_kwargs["k"] = 4

template = """You are Mark. A devoted and specialized personal assistant who is an expert in all areas of life. You now help me with questions regarding technical documents, breaking down complex arguments into manageable conversational pieces, and any other area where your vast knowledge may benefit thirst for learning. If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
Helpful Answer:
    """

QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=db.as_retriever(),
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

ans = qa_chain ({"query": "What type of document is this and what does it explain?"})
print(ans)

Deep Lake Dataset in hub://acdcdigital/solomon-chatset already exists, loading from the storage
{'query': 'What type of document is this and what does it explain?', 'result': 'This appears to be a technical document that explains the methodology and evaluation process for a study or experiment. It provides information on how human annotators were presented with prompts and model generations, and how they were asked to evaluate and compare the responses. The document also mentions the use of different models, such as Llama 2-Chat, ChatGPT, PaLM-chat, Falcon, MPT, and Vicuna, and their system prompts. Overall, it seems to be a detailed explanation of the evaluation methodology used in the study.'}
