-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Lab: Adding Our Own Data to a Multi-Stage Reasoning System

### Working with external knowledge bases 
In this notebook we're going to augment the knowledge base of our LLM with additional data. We will split the notebook into two halves:
- First, we will walk through how to load in a relatively small, local text file using a `DocumentLoader`, split it into chunks, and store it in a vector database using `ChromaDB`.
- Second, you will get a chance to show what you've learned by building a larger system with the complete works of Shakespeare. 
----
### ![Dolly](https://files.training.databricks.com/images/llm/dolly_small.png) Learning Objectives

By the end of this notebook, you will be able to:
1. Add external local data to your LLM's knowledge base via a vector database.
2. Construct a Question-Answer(QA) LLMChain to "talk to your data."
3. Load external data sources from remote locations and store in a vector database.
4. Leverage different retrieval methods to search over your data.

## Classroom Setup

In [0]:
%run ../Includes/Classroom-Setup

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


Resetting the learning environment:
| No action taken

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/large-language-models/v01"

Validating the locally installed datasets:
| listing local files...(8 seconds)
| validation completed...(8 seconds total)


Importing lab testing framework.



Using the "default" schema.

Predefined paths variables:
| DA.paths.working_dir: /dbfs/mnt/dbacademy-users/labuser4359064@vocareum.com/large-language-models
| DA.paths.user_db:     dbfs:/mnt/dbacademy-users/labuser4359064@vocareum.com/large-language-models/database.db
| DA.paths.datasets:    /dbfs/mnt/dbacademy-datasets/large-language-models/v01

Setup completed (22 seconds)

The models developed or used in this course are for demonstration and learning purposes only.
Models may occasionally output offensive, inaccurate, biased information, or harmful instructions.


Import libraries.

In [0]:
%pip install chromadb==0.3.21 tiktoken==0.3.3

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting chromadb==0.3.21
  Downloading chromadb-0.3.21-py3-none-any.whl (46 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46.4/46.4 kB 1.3 MB/s eta 0:00:00
Collecting uvicorn[standard]>=0.18.3
  Downloading uvicorn-0.23.1-py3-none-any.whl (59 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.5/59.5 kB 3.1 MB/s eta 0:00:00
Collecting posthog>=2.4.0
  Downloading posthog-3.0.1-py2.py3-none-any.whl (37 kB)
Collecting fastapi>=0.85.1
  Downloading fastapi-0.100.0-py3-none-any.whl (65 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.7/65.7 kB 12.2 MB/s eta 0:00:00
Collecting hnswlib>=0.7
  Downloading hnswlib-0.7.0.tar.gz (33 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (py

Fill in your credentials.

In [0]:
# TODO
# For many of the services that we'll using in the notebook, we'll need a HuggingFace API key so this cell will ask for it:
# HuggingFace Hub: https://huggingface.co/inference-api

import os
os.environ['HUGGINGFACEHUB_API_TOKEN'] = '<MY_HUGGINGFACEHUB_API_TOKEN>'
display()

## Building a Personalized Document Oracle

In this notebook, we're going to build a special type of LLMChain that will enable us to ask questions of our data. We will be able to "speak to our data".

### Step 1 - Loading Documents into our Vector Store
For this system we'll leverage the [ChromaDB vector database](https://www.trychroma.com/) and load in some text we have on file. This file is of a hypothetical laptop being reviewed in both long form and with brief customer reviews. We'll use LangChain's `TextLoader` to load this data.

In [0]:
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
import pandas as pd
# We have some fake laptop reviews that we can load in
laptop_reviews = TextLoader(f"{DA.paths.datasets}/reviews/fake_laptop_reviews.txt", encoding="utf8")
document = laptop_reviews.load()

print(document)



[Document(page_content='Raytech Supernova Laptop Review: A Star in the Making\nIntroduction\nThe laptop market has become increasingly competitive in recent years, with countless manufacturers vying for consumer attention. Raytech, a relatively new player in the game, has recently released the Supernova laptop, a device that aims to establish itself among the giants of the industry. In this comprehensive review, we will delve into every aspect of the Raytech Supernova laptop, covering its design, performance, features, and value for money. Let\'s find out if this newcomer has what it takes to make an impact in the crowded market.\nDesign and Build Quality\nThe first thing you\'ll notice about the Raytech Supernova is its sleek, modern design. The laptop is encased in a premium, brushed aluminum chassis with a matte finish, lending it an air of sophistication. It\'s a lightweight device, weighing in at just 2.8 pounds, making it easy to carry around for those always on the go. The slim 

### Step 2 - Chunking and Embeddings

Now that we have the data in document format, we will split data into chunks using a `CharacterTextSplitter` and embed this data using Hugging Face's embedding LLM to embed this data for our vector store.

In [0]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

# First we split the data into manageable chunks to store as vectors. There isn't an exact way to do this, more chunks means more detailed context, but will increase the size of our vectorstore.
text_splitter = CharacterTextSplitter(chunk_size=250, chunk_overlap=10)
texts = text_splitter.split_documents(document)
# Now we'll create embeddings for our document so we can store it in a vector store and feed the data into an LLM. We'll use the sentence-transformers model for out embeddings. https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models/ 
model_name = "sentence-transformers/all-MiniLM-L12-v2"
embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    cache_folder=DA.paths.datasets)  # Use a pre-cached model
# Finally we make our Index using chromadb and the embeddings LLM
chromadb_index = Chroma.from_documents(texts, embeddings, persist_directory=DA.paths.working_dir)

Created a chunk of size 6702, which is longer than the specified 250
Created a chunk of size 285, which is longer than the specified 250
Created a chunk of size 278, which is longer than the specified 250
Created a chunk of size 260, which is longer than the specified 250
Created a chunk of size 254, which is longer than the specified 250
Created a chunk of size 258, which is longer than the specified 250
Created a chunk of size 286, which is longer than the specified 250
Created a chunk of size 286, which is longer than the specified 250
Created a chunk of size 275, which is longer than the specified 250
Created a chunk of size 295, which is longer than the specified 250
Using embedded DuckDB with persistence: data will be stored in: /dbfs/mnt/dbacademy-users/labuser4359064@vocareum.com/large-language-models


### Step 3 - Creating our Document QA LLM Chain
With our data now in vector form we need an LLM and a chain to take our queries and create tasks for our LLM to perform.

In [0]:
display
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline

# We want to make this a retriever, so we need to convert our index.  This will create a wrapper around the functionality of our vector database so we can search for similar documents/chunks in the vectorstore and retrieve the results:
retriever = chromadb_index.as_retriever()

# This chain will be used to do QA on the document. We will need
# 1 - A LLM to do the language interpretation
# 2 - A vector database that can perform document retrieval
# 3 - Specification on how to deal with this data (more on this soon)

hf_llm = HuggingFacePipeline.from_model_id(
    model_id="google/flan-t5-large",
    task="text2text-generation",
    model_kwargs={
        "temperature": 0,
        "max_length": 4096,
        "cache_dir": DA.paths.datasets,
    },
)

chain_type = "refine"  # Options: stuff, map_reduce, refine, map_rerank
laptop_qa = RetrievalQA.from_chain_type(
    llm=hf_llm, chain_type="refine", retriever=retriever
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]



Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### Step 4 - Talking to Our Data
Now we are ready to send prompts to our LLM and have it use our prompt, the access to our data, and read the information, process, and return with a response.

In [0]:
# Let's ask the chain about the product we have.
laptop_name = laptop_qa.run(
    "What is the full name of the laptop?"
)
display(laptop_name)

Token indices sequence length is longer than the specified maximum sequence length for this model (1477 > 512). Running this sequence through the model will result in indexing errors


'Raytech Supernova'

In [0]:
# Now we'll ask the chain about the product.
laptop_features = laptop_qa.run(
    "What are some of the laptop's features?"
)
display(laptop_features)

'15.6-inch 4K UHD (3840 x 2160) IPS display'

In [0]:
# Finally let's ask the chain about the reviews.
laptop_reviews = laptop_qa.run(
    "What is the general sentiment of the reviews?"
)
display(laptop_reviews)

'Positive'

## Exercise: Working with larger documents
This document was relatively small. So let's see if we can work with something bigger. To show how well we can scale the vector database, let's load in a larger document. For this we'll get data from the [Gutenberg Project](https://www.gutenberg.org/) where thousands of free-to-access texts. We'll use the complete works of William Shakespeare.

Instead of a local text document, we'll download the complete works of Shakespeare using the `GutenbergLoader` that works with the Gutenberg project: https://www.gutenberg.org

In [0]:
from langchain.document_loaders import GutenbergLoader

loader = GutenbergLoader(
    "https://www.gutenberg.org/cache/epub/100/pg100.txt"
)  # Complete works of Shakespeare in a txt file

all_shakespeare_text = loader.load()

### Question 1

Now it's your turn! Based on what we did previously, fill in the missing parts below to build your own QA LLMChain.

In [0]:
# STEP #1: split text -> embedd -> send to Chroma Vector DB
text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=20)
texts = text_splitter.split_documents(all_shakespeare_text)

model_name = "sentence-transformers/all-MiniLM-L12-v2"
embeddings = HuggingFaceEmbeddings(
  model_name=model_name,
  cache_folder=DA.paths.datasets
)
docsearch = Chroma.from_documents(texts, embeddings, persist_directory=DA.paths.working_dir)

Using embedded DuckDB with persistence: data will be stored in: /dbfs/mnt/dbacademy-users/labuser4359064@vocareum.com/large-language-models


In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion3_1(embeddings, docsearch)

[32mPASSED[0m: All tests passed for lesson3, question1
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 2

Let's see if we can do what we did with the laptop reviews. 

Think about what is likely to happen now. Will this command succeed? 

(***Hint: think about the maximum sequence length of a model***)

In [0]:
# TODO

retriever = docsearch.as_retriever()

hf_llm = HuggingFacePipeline.from_model_id(
    model_id="google/flan-t5-large",
    task="text2text-generation",
    model_kwargs={
        "temperature": 0,
        "max_length": 9800,
        "cache_dir": DA.paths.datasets,
    },
)

chain_type = "map_reduce"  # Options: stuff, map_reduce, refine, map_rerank
qa = RetrievalQA.from_chain_type(
    llm=hf_llm, chain_type="refine", retriever=retriever
)

# Let's start with the simplest method: "Stuff" which puts all of the data into the prompt and asks a question of it:
#qa = RetrievalQA.from_chain_type("stuff")
query = "Who is the main character in the play Hamlet?"

query_results_hamlet = qa.run(
     query
)

query_results_hamlet

'Hamlet (play) is a 16th-century English dramatist'

In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion3_2(qa, query_results_hamlet)

[32mPASSED[0m: All tests passed for lesson3, question2
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 3

Now that we're working with larger documents, we should be mindful of the input sequence limitations that our LLM has. 

Chain Types for document loader:

- [`stuff`](https://docs.langchain.com/docs/components/chains/index_related_chains#stuffing) - Stuffing is the simplest method, whereby you simply stuff all the related data into the prompt as context to pass to the language model.
- [`map_reduce`](https://docs.langchain.com/docs/components/chains/index_related_chains#map-reduce) - This method involves running an initial prompt on each chunk of data (for summarization tasks, this could be a summary of that chunk; for question-answering tasks, it could be an answer based solely on that chunk).
- [`refine`](https://docs.langchain.com/docs/components/chains/index_related_chains#refine) - This method involves running an initial prompt on the first chunk of data, generating some output. For the remaining documents, that output is passed in, along with the next document, asking the LLM to refine the output based on the new document.
- [`map_rerank`](https://docs.langchain.com/docs/components/chains/index_related_chains#map-rerank) - This method involves running an initial prompt on each chunk of data, that not only tries to complete a task but also gives a score for how certain it is in its answer. The responses are then ranked according to this score, and the highest score is returned.

In [0]:
chain_type = "refine"  # Options: stuff, map_reduce, refine, map_rerank
# qa = RetrievalQA.from_chain_type(
#     llm=hf_llm, chain_type=chain_type, retriever=retriever
# )

# TODO
qa = RetrievalQA.from_chain_type(
  llm=hf_llm, chain_type=chain_type, retriever=docsearch.as_retriever())
query = "Who is the main character in the Merchant of Venice?"
query_results_venice = qa.run(query)
 
query_results_venice

'A MERCHANT, friend to Antipholus of Syracuse.'

In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion3_3(qa, query_results_venice)

[32mPASSED[0m: All tests passed for lesson3, question3
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


In [0]:
# TODO
#That's much better! Let's try another type
chain_type = "refine"  # Options: stuff, map_reduce, refine, map_rerank
qa = RetrievalQA.from_chain_type(
    llm=hf_llm, chain_type=chain_type, retriever=retriever
)
query = "What happens to romeo and juliet?"
query_results_romeo = qa.run(query)
 
query_results_romeo

'Romeo is belov’d, and loves again, Alike bewitched by the charm of looks; But to his foe suppos’d he must complain, And she steal love’s sweet bait from fearful hooks: Being held a foe, he may not have access'

In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion3_4(qa, query_results_romeo)

[32mPASSED[0m: All tests passed for lesson3, question4
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


## Submit your Results!

To get credit for this lab, click the submit button to report the results. If you run into any issues, click `Run` -> `Clear state and run all`, and make sure all tests have passed before re-submitting. If you accidentally deleted any tests, take a look at the notebook's version history to recover them or reload the notebooks.

-sandbox
&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>