In [4]:
import os
import openai
import sys
from dotenv import load_dotenv, find_dotenv, set_key

In [5]:
# 当前windows
# 获取当前的 Conda 环境路径
conda_env_path = os.environ.get('CONDA_PREFIX')

# ".env" 文件的绝对路径
dotenv_path = os.path.join(conda_env_path, '.env')

# 加载 ".env" 文件
_ = load_dotenv(dotenv_path, verbose=True)
openai.api_key = os.environ['OPENAI_API_KEY']

# Intro to Keeping Knowledge Organized with Indexes

- Exploring The Role of LangChain's Indexes and Retrievers: To kick off the module, we introduce the Deep Lake database and its seamless integration with the LangChain library. This lesson highlights the benefits of utilizing Deep Lake, including the ability to retrieve pertinent documents for contextual use. Additionally, we delve into the limitations of this approach and present solutions to overcome them.
- Streamlined Data Ingestion: Text, PyPDF, Selenium URL Loaders, and Google Drive Sync: The LangChain library offers a variety of helper classes designed to facilitate data loading and extraction from diverse sources. Regardless of whether the information originates from a PDF file or website content, these classes streamline the process of handling different data formats.
- What are Text Splitters and Why They are Useful: The length of the contents may vary depending on their source. For instance, a PDF file containing a book may exceed the input window size of the model, making it incompatible with direct processing. However, splitting the large text into smaller segments will allow us to use the most relevant chunk as the context instead of expecting the model to comprehend the whole book and answer a question. This lesson will thoroughly explore different approaches that enable us to accomplish this objective.
- Exploring the World of Embeddings: Embeddings are high-dimensional vectors that capture semantic information. Large language models can transform textual data into embedding space, allowing for versatile representations across languages. These embeddings serve as valuable tools to identify relevant information by quantifying the distance between data points, thereby indicating closer semantic meaning for points closer together. The LangChain integration provides necessary functions for both transforming and calculating similarities.
- Build a Customer Support Question Answering Chatbot: This practical example demonstrates the utilization of a website's content as supplementary context for a chatbot to respond to user queries effectively. The code implementation involves employing the mentioned data loaders, storing the corresponding embeddings in the Deep Lake dataset, and ultimately retrieving the most pertinent documents based on the user's question.
- Conversation Intelligence: Gong.io Open-Source Alternative AI Sales Assistant: In this lesson, we will explore how LangChain, Deep Lake, and GPT-4 can be used to develop a sales assistant able to give advice to salesman, taking into considerations internal guidelines.
- FableForge: Creating Picture Books with OpenAI, Replicate, and Deep Lake: In this final lesson, we are going to delve into a use case of AI technology in the creative domain of children's picture book creation in a project called "FableForge", leveraging both OpenAI GPT-3.5 LLM for writing the story and Stable Diffusion for generating images for it.

# Exploring the Role of Langchain's Indexes and Retrievers

In [4]:
from langchain.document_loaders import TextLoader

# text to write to a local file
# taken from https://www.theverge.com/2023/3/14/23639313/google-ai-language-model-palm-api-challenge-openai
text = """Google opens up its AI language model PaLM to challenge OpenAI and GPT-3
Google is offering developers access to one of its most advanced AI language models: PaLM.
The search giant is launching an API for PaLM alongside a number of AI enterprise tools
it says will help businesses “generate text, images, code, videos, audio, and more from
simple natural language prompts.”

PaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or
Meta’s LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs,
PaLM is a flexible system that can potentially carry out all sorts of text generation and
editing tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for
example, or you could use it for tasks like summarizing text or even writing code.
(It’s similar to features Google also announced today for its Workspace apps like Google
Docs and Gmail.)
"""

# write text to local file
with open("source/my_file.txt", "w") as file:
    file.write(text)

# use TextLoader to load text from local file
loader = TextLoader("source/my_file.txt")
docs_from_file = loader.load()

print(len(docs_from_file))

1


In [5]:
from langchain.text_splitter import CharacterTextSplitter

# create a text splitter
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20)

# split documents into chunks
docs = text_splitter.split_documents(docs_from_file)

print(len(docs))

Created a chunk of size 373, which is longer than the specified 200


2


In [6]:
from langchain.embeddings import OpenAIEmbeddings

# Before executing the following code, make sure to have
# your OpenAI key saved in the “OPENAI_API_KEY” environment variable.
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

In [4]:
#!pip install --upgrade -r requirements.txt

In [8]:
from langchain.vectorstores import DeepLake

# Before executing the following code, make sure to have your
# Activeloop key saved in the “ACTIVELOOP_TOKEN” environment variable.

# create Deep Lake dataset
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = os.environ["ACTIVELOOP-ORG-ID"]
my_activeloop_dataset_name = "langchain_course_indexers_retrievers"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to our Deep Lake dataset
db.add_documents(docs)

Your Deep Lake dataset has been successfully created!


 

Dataset(path='hub://bettermaxfeng/langchain_course_indexers_retrievers', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
 embedding  embedding  (2, 1536)  float32   None   
    id        text      (2, 1)      str     None   
 metadata     json      (2, 1)      str     None   
   text       text      (2, 1)      str     None   


['e9ff15b0-407b-11ee-b60b-d89c6787905c',
 'e9ff15b1-407b-11ee-98a7-d89c6787905c']

In [9]:
# create retriever from db
retriever = db.as_retriever()

In [10]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# create a retrieval chain
qa_chain = RetrievalQA.from_chain_type(
	llm=OpenAI(model="text-davinci-003"),
	chain_type="stuff",
	retriever=retriever
)

In [11]:
query = "How Google plans to challenge OpenAI?"
response = qa_chain.run(query)
print(response)

 Google plans to challenge OpenAI by offering developers access to one of its most advanced AI language models, called PaLM.


A `DocumentCompressor` abstraction has been introduced to address this issue, allowing compress_documents on the retrieved documents.

The `ContextualCompressionRetriever` is a wrapper around another retriever in LangChain. It takes a base retriever and a `DocumentCompressor` and automatically compresses the retrieved documents from the base retriever. This means that only the most relevant parts of the retrieved documents are returned, given a specific query.

A popular compressor choice is the `LLMChainExtractor`, which uses an LLMChain to extract only the statements relevant to the query from the documents. To improve the retrieval process, a ContextualCompressionRetriever is used, wrapping the base retriever with an LLMChainExtractor. The LLMChainExtractor iterates over the initially returned documents and extracts only the content relevant to the query. 

Here's an example of how to use `ContextualCompressionRetriever` with `LLMChainExtractor`:

In [12]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# create GPT3 wrapper
llm = OpenAI(model="text-davinci-003", temperature=0)

# create compressor for the retriever
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
	base_compressor=compressor,
	base_retriever=retriever
)

In [13]:
# retrieving compressed documents
retrieved_docs = compression_retriever.get_relevant_documents(
	"How Google plans to challenge OpenAI?"
)
print(retrieved_docs[0].page_content)



Google is offering developers access to one of its most advanced AI language models: PaLM. The search giant is launching an API for PaLM alongside a number of AI enterprise tools it says will help businesses “generate text, images, code, videos, audio, and more from simple natural language prompts.”


# Streamlined Data Ingestion: Text, PyPDF,  Selenium URL Loaders, and Google Drive Sync

## TextLoader

In [None]:
from langchain.document_loaders import TextLoader
loader = TextLoader('file_path.txt')
documents = loader.load()

You can use the `encoding` argument to change the encoding type. (For example:  `encoding="ISO-8859-1"`)

## PyPDFLoader (PDF)

In [14]:
# !pip install -q pypdf

In [15]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("source/GeoMan论文.pdf") 
pages = loader.load_and_split()

print(pages[0])

page_content='GeoMAN: Multi-level Attention Networks for Geo-sensory Time Series Prediction\nYuxuan Liang1;2, Songyu Ke3;2, Junbo Zhang2;4, Xiuwen Yi4;2, Yu Zheng2;1;3;4\n1School of Computer Science and Technology, Xidian University, Xi’an, China\n2Urban Computing Business Unit, JD Finance, Beijing, China\n3Zhiyuan College, Shanghai Jiao Tong University, Shanghai, China\n4School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China\nfyuxliang, songyu-ke, msjunbozhang, xiuwyi, msyuzhengg@outlook.com\nAbstract\nNumerous sensors have been deployed in different\ngeospatial locations to continuously and coopera-\ntively monitor the surrounding environment, such\nas the air quality. These sensors generate multiple\ngeo-sensory time series, with spatial correlations\nbetween their readings. Forecasting geo-sensory\ntime series is of great importance yet very chal-\nlenging as it is affected by many complex factors,\ni.e., dynamic spatio-temporal correlations and

## SeleniumURLLoader (URL)

In [17]:
#!pip install -q unstructured selenium

In [18]:
from langchain.document_loaders import SeleniumURLLoader

urls = [
    "https://www.youtube.com/watch?v=TFa539R09EQ&t=139s",
    "https://www.youtube.com/watch?v=6Zv6A_9urh4&t=112s"
]

loader = SeleniumURLLoader(urls=urls)
data = loader.load()

print(data[0])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


page_content="OPENASSISTANT TAKES ON CHATGPT!\n\nSearch\n\nInfo\n\nShopping\n\nWatch Later\n\nShare\n\nCopy link\n\nTap to unmute\n\nIf playback doesn't begin shortly, try restarting your device.\n\nUp next\n\nLiveUpcoming\n\nPlay now\n\nYou're signed out\n\nVideos that you watch may be added to the TV's watch history and influence TV recommendations. To avoid this, cancel and sign in to YouTube on your computer.\n\nMachine Learning Street Talk\n\nSubscribe\n\nSubscribed\n\nSwitch camera\n\nShare\n\nAn error occurred while retrieving sharing information. Please try again later.\n\n2:19\n\n2:19 / 59:51\n\nWatch full video\n\n•\n\nScroll for details\n\nNaN / NaN\n\nSearch" metadata={'source': 'https://www.youtube.com/watch?v=TFa539R09EQ&t=139s'}


In [None]:
loader = SeleniumURLLoader(urls=urls, browser="firefox")

## Google Drive loader

In [19]:
from langchain.document_loaders import GoogleDriveLoader

In [None]:
loader = GoogleDriveLoader(
    folder_id="your_folder_id",
    recursive=False  # Optional: Fetch files from subfolders recursively. Defaults to False.
)

- Folder: `https://drive.google.com/drive/u/0/folders/{folder_id}`
- Document: `https://docs.google.com/document/d/{document_id}/edit`

In [None]:
docs = loader.load()

## What are Text Splitters and Why They are Useful

Pros:

- Reduced hallucination: By providing a source document, the LLM is more likely to generate content based on the given information, reducing the chances of creating false or irrelevant information.
- Increased accuracy: With a reliable source document, the LLM can generate more accurate answers, especially in use cases where accuracy is crucial.
- Verifiable information: Users can cross-check the generated content with the source document to ensure the information is accurate and reliable.

Cons:

- Limited scope: Relying on a single document may limit the scope of the generated content, as the LLM will only have access to the information provided in the document.
- Dependence on document quality: The accuracy of the generated content heavily depends on the quality and reliability of the source document. The LLM will likely generate incorrect or misleading content if the document contains inaccurate or biased information.
- Inability to eliminate hallucination completely: Although providing a document as a base reduces the chances of hallucination, it does not guarantee that the LLM will never generate false or irrelevant information.

## Customizing Text Splitter

At a high level, text splitters follow these steps:

- Divide the text into small, semantically meaningful chunks (often sentences).
- Combine these small chunks into a larger one until a specific size is reached (determined by a particular function).
- Once the desired size is attained, separate that chunk as an individual piece of text, then start forming a new chunk with some overlap to maintain context between segments.

Consequently, there are two primary dimensions to consider when customizing your text splitter:

- The method used to split the text
- The approach for measuring chunk size

### Character Text Splitter

In [21]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("source/The One Page Linux Manual.pdf")
pages = loader.load_and_split()

In [24]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
texts = text_splitter.split_documents(pages)

print(f'length: {len(texts)}, first text: {texts[0]}')

# print (f"You have {len(texts)} documents")
print ("Preview:")
print (texts[0].page_content)

length: 2, first text: page_content='THE ONE     PAGE LINUX MANUALA summary of useful Linux commands\nVersion 3.0 May 1999 squadron@powerup.com.au\nStarting & Stopping\nshutdown -h now Shutdown the system now and do not\nreboot\nhalt Stop all processes - same as above\nshutdown -r 5 Shutdown the system in 5 minutes and\nreboot\nshutdown -r now Shutdown the system now and reboot\nreboot Stop all processes and then reboot - same\nas above\nstartx Start the X system\nAccessing & mounting file systems\nmount -t iso9660 /dev/cdrom\n/mnt/cdromMount the device cdrom\nand call it cdrom under the\n/mnt directory\nmount -t msdos /dev/hdd\n/mnt/ddriveMount hard disk “d” as a\nmsdos file system and call\nit ddrive under the /mnt\ndirectory\nmount -t vfat /dev/hda1\n/mnt/cdriveMount hard disk “a” as a\nVFAT file system and call it\ncdrive under the /mnt\ndirectory\numount /mnt/cdrom Unmount the cdrom\nFinding files and text within files\nfind / -name  fname Starting with the root directory, look\nf

Finding the best chunk size for your project means going through a few steps. 
- First, clean up your data by getting rid of anything that's not needed, like HTML tags from websites.
- Then, pick a few different chunk sizes to test. The best size will depend on what kind of data you're working with and the model you're using.
- Finally, test out how well each size works by running some queries and comparing the results.
- You might need to try a few different sizes before finding the best one.

### Recursive Character Text Splitter

- chunk_size : The maximum size of the chunks, as measured by the length_function (default is 100).
- chunk_overlap: The maximum overlap between chunks to maintain continuity between them (default is 20).
- length_function: parameter is used to calculate the length of the chunks. By default, it is set to len, which counts the number of characters in a chunk. However, you can also pass a token counter or any other function that calculates the length of a chunk based on your specific requirements.

Using a token counter instead of the default `len` function can benefit specific scenarios, such as when working with language models with token limits. For example, OpenAI's GPT-3 has a token limit of 4096 tokens per request, so you might want to count tokens instead of characters to better manage and optimize your requests.

In [25]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("source/The One Page Linux Manual.pdf")
pages = loader.load_and_split()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=50,
    chunk_overlap=10,
    length_function=len,
)

docs = text_splitter.split_documents(pages)
for doc in docs:
    print(doc)

page_content='THE ONE     PAGE LINUX MANUALA summary of useful' metadata={'source': 'source/The One Page Linux Manual.pdf', 'page': 0}
page_content='of useful Linux commands' metadata={'source': 'source/The One Page Linux Manual.pdf', 'page': 0}
page_content='Version 3.0 May 1999 squadron@powerup.com.au' metadata={'source': 'source/The One Page Linux Manual.pdf', 'page': 0}
page_content='Starting & Stopping' metadata={'source': 'source/The One Page Linux Manual.pdf', 'page': 0}
page_content='shutdown -h now Shutdown the system now and do' metadata={'source': 'source/The One Page Linux Manual.pdf', 'page': 0}
page_content='and do not' metadata={'source': 'source/The One Page Linux Manual.pdf', 'page': 0}
page_content='reboot\nhalt Stop all processes - same as above' metadata={'source': 'source/The One Page Linux Manual.pdf', 'page': 0}
page_content='shutdown -r 5 Shutdown the system in 5 minutes' metadata={'source': 'source/The One Page Linux Manual.pdf', 'page': 0}
page_content='5 minu

We created an instance of the `RecursiveCharacterTextSplitter` class with the desired parameters. The default list of characters to split by is `["\n\n", "\n", " ", ""]`.

The text is first split by two new-line characters `(\n\n)`. Then, since the chunks are still larger than the desired chunk size (50), the class tries to split the output by a single new-line character `(\n)`.

### NLTK Text Splitter

In [26]:
# !pip install -q nltk

In [28]:
import nltk

In [30]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [31]:
from langchain.text_splitter import NLTKTextSplitter

# Load a long document
with open('/home/cloudsuperadmin/scrape-chain/langchain/LLM.txt', encoding= 'unicode_escape') as f:
    sample_text = f.read()

text_splitter = NLTKTextSplitter(chunk_size=500)
texts = text_splitter.split_text(sample_text)
print(texts)

FileNotFoundError: [Errno 2] No such file or directory: '/home/cloudsuperadmin/scrape-chain/langchain/LLM.txt'

However, as mentioned in your context, the NLTKTextSplitter is not specifically designed to handle word segmentation in English sentences without spaces. For this purpose, you can use alternative libraries like pyenchant or word segment.

### SpacyTextSplitter

The SpacyTextSplitter helps split large text documents into smaller chunks based on a specified size. This is useful for better management of large text inputs. It's important to note that the SpacyTextSplitter is an alternative to NLTK-based sentence splitting. You can create a SpacyTextSplitter object by specifying the chunk_size parameter, measured by a length function passed to it, which defaults to the number of characters.

In [None]:
from langchain.text_splitter import SpacyTextSplitter

# Load a long document
with open('/home/cloudsuperadmin/scrape-chain/langchain/LLM.txt', encoding= 'unicode_escape') as f:
    sample_text = f.read()

# Instantiate the SpacyTextSplitter with the desired chunk size
text_splitter = SpacyTextSplitter(chunk_size=500, chunk_overlap=20)

# Split the text using SpacyTextSplitter
texts = text_splitter.split_text(sample_text)

# Print the first chunk
print(texts[0])

### MarkdownTextSplitter

The MarkdownTextSplitter is designed to split text written using Markdown languages like headers, code blocks, or dividers. It is implemented as a simple subclass of RecursiveCharacterSplitter with Markdown-specific separators. By default, these separators are determined by the Markdown syntax, but they can be customized by providing a list of characters during the initialization of the MarkdownTextSplitter instance. The chunk size, which is initially set to the number of characters, is measured by the length function passed in. To customize the chunk size, provide an integer value when initializing an instance.

In [32]:
from langchain.text_splitter import MarkdownTextSplitter

markdown_text = """
# 

# Welcome to My Blog!

## Introduction
Hello everyone! My name is **John Doe** and I am a _software developer_. I specialize in Python, Java, and JavaScript.

Here's a list of my favorite programming languages:

1. Python
2. JavaScript
3. Java

You can check out some of my projects on [GitHub](https://github.com).

## About this Blog
In this blog, I will share my journey as a software developer. I'll post tutorials, my thoughts on the latest technology trends, and occasional book reviews.

Here's a small piece of Python code to say hello:

\``` python
def say_hello(name):
    print(f"Hello, {name}!")

say_hello("John")
\```

Stay tuned for more updates!

## Contact Me
Feel free to reach out to me on [Twitter](https://twitter.com) or send me an email at johndoe@email.com.

"""

markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_text])

print(docs)

[Document(page_content='# \n\n# Welcome to My Blog!', metadata={}), Document(page_content='## Introduction', metadata={}), Document(page_content='Hello everyone! My name is **John Doe** and I am a _software developer_. I specialize in Python,', metadata={}), Document(page_content='Java, and JavaScript.', metadata={}), Document(page_content="Here's a list of my favorite programming languages:\n\n1. Python\n2. JavaScript\n3. Java", metadata={}), Document(page_content='You can check out some of my projects on [GitHub](https://github.com).', metadata={}), Document(page_content='## About this Blog', metadata={}), Document(page_content="In this blog, I will share my journey as a software developer. I'll post tutorials, my thoughts on", metadata={}), Document(page_content='the latest technology trends, and occasional book reviews.', metadata={}), Document(page_content="Here's a small piece of Python code to say hello:", metadata={}), Document(page_content='\\``` python\ndef say_hello(name):\n

### TokenTextSplitter

In [33]:
!pip install -q tiktoken

In [None]:
from langchain.text_splitter import TokenTextSplitter

# Load a long document
with open('/home/cloudsuperadmin/scrape-chain/langchain/LLM.txt', encoding= 'unicode_escape') as f:
    sample_text = f.read()

# Initialize the TokenTextSplitter with desired chunk size and overlap
text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=50)

# Split into smaller chunks
texts = text_splitter.split_text(sample_text)
print(texts[0])

- CharacterTextSplitter is an example that helps balance manageable pieces and semantic context preservation. Experimenting with different chunk sizes and overlaps tailor the results for specific use cases.
- RecursiveCharacterTextSplitter focuses on preserving semantic relationships while offering customizable chunk sizes and overlaps.
- NLTKTextSplitter utilizes the Natural Language Toolkit library for more accurate text segmentation.
- SpacyTextSplitter leverages the popular SpaCy library to split texts based on linguistic features.
- MarkdownTextSplitter is tailored for Markdown-formatted texts, ensuring content is split meaningfully according to the syntax.
- Lastly, TokenTextSplitter employs BPE tokens for splitting, offering a fine-grained approach to text segmentation.

# Exploring the World of Embeddings

## Similarity search and vector embeddings 

In [37]:
#!pip install scikit-learn

In [38]:
import openai
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from langchain.embeddings import OpenAIEmbeddings

# Define the documents
documents = [
    "The cat is on the mat.",
    "There is a cat on the mat.",
    "The dog is in the yard.",
    "There is a dog in the yard.",
]

# Initialize the OpenAIEmbeddings instance
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Generate embeddings for the documents
document_embeddings = embeddings.embed_documents(documents)

# Perform a similarity search for a given query
query = "A cat is sitting on a mat."
query_embedding = embeddings.embed_query(query)

# Calculate similarity scores
similarity_scores = cosine_similarity([query_embedding], document_embeddings)[0]

# Find the most similar document
most_similar_index = np.argmax(similarity_scores)
most_similar_document = documents[most_similar_index]

print(f"Most similar document to the query '{query}':")
print(most_similar_document)

Most similar document to the query 'A cat is sitting on a mat.':
The cat is on the mat.


## Embedding Models

In [41]:
#!pip install sentence_transformers===2.2.2

In [42]:
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
hf = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

documents = ["Document 1", "Document 2", "Document 3"]
doc_embeddings = hf.embed_documents(documents)

  from .autonotebook import tqdm as notebook_tqdm
Downloading (…)a8e1d/.gitattributes: 100%|█████████████████████████████████████████| 1.18k/1.18k [00:00<00:00, 393kB/s]
Downloading (…)_Pooling/config.json: 100%|████████████████████████████████████████████| 190/190 [00:00<00:00, 31.9kB/s]
Downloading (…)b20bca8e1d/README.md: 100%|████████████████████████████████████████| 10.6k/10.6k [00:00<00:00, 3.50MB/s]
Downloading (…)0bca8e1d/config.json: 100%|█████████████████████████████████████████████| 571/571 [00:00<00:00, 191kB/s]
Downloading (…)ce_transformers.json: 100%|████████████████████████████████████████████| 116/116 [00:00<00:00, 14.5kB/s]
Downloading (…)e1d/data_config.json: 100%|████████████████████████████████████████| 39.3k/39.3k [00:00<00:00, 2.12MB/s]
Downloading pytorch_model.bin: 100%|████████████████████████████████████████████████| 438M/438M [00:38<00:00, 11.4MB/s]
Downloading (…)nce_bert_config.json: 100%|██████████████████████████████████████████| 53.0/53.0 [00:00<00:00, 

## Cohere embeddings

In [44]:
#!pip install cohere

In [48]:
import cohere
from langchain.embeddings import CohereEmbeddings

# Initialize the CohereEmbeddings object
cohere = CohereEmbeddings(
	model="embed-multilingual-v2.0",
	cohere_api_key=os.environ['COHERE_API_KEY']
)

# Define a list of texts
texts = [
    "Hello from Cohere!", 
    "مرحبًا من كوهير!", 
    "Hallo von Cohere!",  
    "Bonjour de Cohere!", 
    "¡Hola desde Cohere!", 
    "Olá do Cohere!",  
    "Ciao da Cohere!", 
    "您好，来自 Cohere！", 
    "कोहेरे से नमस्ते!"
]

# Generate embeddings for the texts
document_embeddings = cohere.embed_documents(texts)

# Print the embeddings
for text, embedding in zip(texts, document_embeddings):
    print(f"Text: {text}")
    print(f"Embedding: {embedding[:5]}")  # print first 5 dimensions of each embedding

Text: Hello from Cohere!
Embedding: [0.23449707, 0.50097656, -0.04876709, 0.14001465, -0.1796875]
Text: مرحبًا من كوهير!
Embedding: [0.25341797, 0.30004883, 0.01083374, 0.12573242, -0.1821289]
Text: Hallo von Cohere!
Embedding: [0.10205078, 0.28320312, -0.0496521, 0.2364502, -0.0715332]
Text: Bonjour de Cohere!
Embedding: [0.15161133, 0.28222656, -0.057281494, 0.11743164, -0.044189453]
Text: ¡Hola desde Cohere!
Embedding: [0.25146484, 0.43139648, -0.08642578, 0.24682617, -0.117004395]
Text: Olá do Cohere!
Embedding: [0.18676758, 0.390625, -0.04550171, 0.14562988, -0.11230469]
Text: Ciao da Cohere!
Embedding: [0.11590576, 0.4333496, -0.025772095, 0.14538574, 0.0703125]
Text: 您好，来自 Cohere！
Embedding: [0.24645996, 0.3083496, -0.111816406, 0.26586914, -0.05102539]
Text: कोहेरे से नमस्ते!
Embedding: [0.19274902, 0.6352539, 0.031951904, 0.117370605, -0.26098633]


## Deep Lake Vector Store

In [49]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

In [50]:
# create our documents
texts = [
    "Napoleon Bonaparte was born in 15 August 1769",
    "Louis XIV was born in 5 September 1638",
    "Lady Gaga was born in 28 March 1986",
    "Michael Jeffrey Jordan was born in 17 February 1963"
]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.create_documents(texts)

In [51]:
# initialize embeddings model
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# create Deep Lake dataset
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = os.environ["ACTIVELOOP-ORG-ID"]
my_activeloop_dataset_name = "langchain_course_embeddings"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to our Deep Lake dataset
db.add_documents(docs)

Your Deep Lake dataset has been successfully created!


\

Dataset(path='hub://bettermaxfeng/langchain_course_embeddings', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
 embedding  embedding  (4, 1536)  float32   None   
    id        text      (4, 1)      str     None   
 metadata     json      (4, 1)      str     None   
   text       text      (4, 1)      str     None   


 

['0004ace2-4083-11ee-b8f9-d89c6787905c',
 '0004ace3-4083-11ee-9aea-d89c6787905c',
 '0004ace4-4083-11ee-8297-d89c6787905c',
 '0004ace5-4083-11ee-9092-d89c6787905c']

In [52]:
# create retriever from db
retriever = db.as_retriever()

In [53]:
# istantiate the llm wrapper
model = ChatOpenAI(model='gpt-3.5-turbo')

# create the question-answering chain
qa_chain = RetrievalQA.from_llm(model, retriever=retriever)

# ask a question to the chain
qa_chain.run("When was Michael Jordan born?")

'Michael Jordan was born on 17 February 1963.'

1. OpenAI and LangChain Integration: LangChain, a library built for chaining NLP models, is designed to work seamlessly with OpenAI's GPT-3.5-turbo model for language understanding and generation. You've initialized OpenAI embeddings using OpenAIEmbeddings(), and these embeddings are later used to transform the text into a high-dimensional vector representation. This vector representation captures the semantic essence of the text and is essential for information retrieval tasks.
2. Deep Lake: Deep Lake is a Vector Store for creating, storing, and querying vector representations (also known as embeddings) of data.
3. Text Retrieval: Using the db.as_retriever() function, you've transformed the Deep Lake dataset into a retriever object. This object is designed to fetch the most relevant pieces of text from the dataset based on the semantic similarity of their embeddings.
4. Question Answering: The final step involves setting up a RetrievalQA chain from LangChain. This chain is designed to accept a natural language question, transform it into an embedding, retrieve the most relevant document chunks from the Deep Lake dataset, and generate a natural language answer. The ChatOpenAI model, which is the underlying model of this chain, is responsible for both the question embedding and the answer generation.

# Build a Customer Support Question Answering Chatbot

## Having a Knowledge Base

## Workflow

![图片描述](pics/chatbot_workflow.webp)

In [3]:
# !pip install unstructured selenium

In [6]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import CharacterTextSplitter
from langchain import OpenAI
from langchain.document_loaders import SeleniumURLLoader
from langchain import PromptTemplate



In [8]:
# we'll use information from the following articles
urls = ['https://beebom.com/what-is-nft-explained/',
        'https://beebom.com/how-delete-spotify-account/',
        'https://beebom.com/how-download-gif-twitter/',
        'https://beebom.com/how-use-chatgpt-linux-terminal/',
        'https://beebom.com/how-delete-spotify-account/',
        'https://beebom.com/how-save-instagram-story-with-music/',
        'https://beebom.com/how-install-pip-windows/',
        'https://beebom.com/how-check-disk-usage-linux/']

### 1: Split the documents into chunks and compute their embeddings

In [9]:
# use the selenium scraper to load the documents
loader = SeleniumURLLoader(urls=urls)
docs_not_splitted = loader.load()

# we split the documents into smaller chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(docs_not_splitted)

Created a chunk of size 1226, which is longer than the specified 1000


In [10]:
# Before executing the following code, make sure to have
# your OpenAI key saved in the “OPENAI_API_KEY” environment variable.
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# create Deep Lake dataset
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = os.environ["ACTIVELOOP-ORG-ID"]
my_activeloop_dataset_name = "langchain_course_customer_support"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to our Deep Lake dataset
db.add_documents(docs)

Your Deep Lake dataset has been successfully created!


-

Dataset(path='hub://bettermaxfeng/langchain_course_customer_support', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
 embedding  embedding  (127, 1536)  float32   None   
    id        text      (127, 1)      str     None   
 metadata     json      (127, 1)      str     None   
   text       text      (127, 1)      str     None   




['cff08f7c-420e-11ee-b60b-d89c6787905c',
 'cff08f7d-420e-11ee-98a7-d89c6787905c',
 'cff08f7e-420e-11ee-b082-d89c6787905c',
 'cff08f7f-420e-11ee-b8f9-d89c6787905c',
 'cff08f80-420e-11ee-9aea-d89c6787905c',
 'cff0b66d-420e-11ee-8297-d89c6787905c',
 'cff0b66e-420e-11ee-9092-d89c6787905c',
 'cff0b66f-420e-11ee-bdca-d89c6787905c',
 'cff0b670-420e-11ee-a0b8-d89c6787905c',
 'cff0b671-420e-11ee-9f19-d89c6787905c',
 'cff0b672-420e-11ee-99ea-d89c6787905c',
 'cff0b673-420e-11ee-bac4-d89c6787905c',
 'cff0b674-420e-11ee-b229-d89c6787905c',
 'cff0b675-420e-11ee-b51c-d89c6787905c',
 'cff0b676-420e-11ee-9369-d89c6787905c',
 'cff0b677-420e-11ee-bdf0-d89c6787905c',
 'cff0b678-420e-11ee-9e80-d89c6787905c',
 'cff0b679-420e-11ee-96ea-d89c6787905c',
 'cff0b67a-420e-11ee-a556-d89c6787905c',
 'cff0b67b-420e-11ee-b910-d89c6787905c',
 'cff0b67c-420e-11ee-ba1e-d89c6787905c',
 'cff0b67d-420e-11ee-8dfa-d89c6787905c',
 'cff0b67e-420e-11ee-a04c-d89c6787905c',
 'cff0b67f-420e-11ee-88e9-d89c6787905c',
 'cff0b680-420e-

In [11]:
# let's see the top relevant documents to a specific query
query = "how to check disk usage in linux?"
docs = db.similarity_search(query)
print(docs[0].page_content)

Home  Tech  How to Check Disk Usage in Linux (4 Methods)

How to Check Disk Usage in Linux (4 Methods)

Beebom Staff

Last Updated: June 19, 2023 5:14 pm

There may be times when you need to download some important files or transfer some photos to your Linux system, but face a problem of insufficient disk space. You head over to your file manager to delete the large files which you no longer require, but you have no clue which of them are occupying most of your disk space. In this article, we will show some easy methods to check disk usage in Linux from both the terminal and the GUI application.

Monitor Disk Usage in Linux (2023)

Table of Contents

Check Disk Space Using the df Command
		
Display Disk Usage in Human Readable FormatDisplay Disk Occupancy of a Particular Type

Check Disk Usage using the du Command
		
Display Disk Usage in Human Readable FormatDisplay Disk Usage for a Particular DirectoryCompare Disk Usage of Two Directories


### 2: Craft a prompt for GPT-3 using the suggested strategies

In [12]:
# let's write a prompt for a customer support chatbot that
# answer questions using information extracted from our db
template = """You are an exceptional customer support chatbot that gently answer questions.

You know the following context information.

{chunks_formatted}

Answer to the following question from a customer. Use only information from the previous context information. Do not invent stuff.

Question: {query}

Answer:"""

prompt = PromptTemplate(
    input_variables=["chunks_formatted", "query"],
    template=template,
)

### 3: Utilize the GPT3 model with a temperature of 0 for text generation

In [13]:
# the full pipeline

# user question
query = "How to check disk usage in linux?"

# retrieve relevant chunks
docs = db.similarity_search(query)
retrieved_chunks = [doc.page_content for doc in docs]

# format the prompt
chunks_formatted = "\n\n".join(retrieved_chunks)
prompt_formatted = prompt.format(chunks_formatted=chunks_formatted, query=query)

# generate answer
llm = OpenAI(model="text-davinci-003", temperature=0)
answer = llm(prompt_formatted)
print(answer)

 You can check disk usage in Linux using the df command or by using a GUI tool such as the GDU Disk Usage Analyzer or the Gnome Disks Tool. The df command is used to check the current disk usage and the available disk space in Linux. The syntax for the df command is: df <options> <file_system>. The options to use with the df command are: a, h, t, and x. To install the GDU Disk Usage Analyzer, use the command: sudo snap install gdu-disk-usage-analyzer. To install the Gnome Disks Tool, use the command: sudo apt-get -y install gnome-disk-utility.


## Issues with Generating Answers using GPT-3

Suppose we ask, "Is the Linux distribution free?" and provide GPT-3 with a document about kernel features as context. It might generate an answer like "Yes, the Linux distribution is free to download and use," even if such information is not present in the context document. Producing false information is highly undesirable for customer service chatbots!

GPT-3 is less likely to generate false information when the answer to the user's question is contained within the context. Since user questions are often brief and ambiguous, we cannot always rely on the semantic search step to retrieve the correct document. Thus, there is always a risk of generating false information.

# Conversation Intelligence: Gong.io Open-Source Alternative AI Sales Assistant

## Didn't Work: Naively Splitting the Custom Knowledge Base
> Objection: "There's no money."
It could be that your prospect's business simply isn't big enough or generating enough cash right now to afford a product like yours. Track their growth and see how you can help your prospect get to a place where your offering would fit into their business.

> Objection: "We don't have any budget left this year."
A variation of the "no money" objection, what your prospect's telling you here is that they're having cash flow issues. But if there's a pressing problem, it needs to get solved eventually. Either help your prospect secure a budget from executives to buy now or arrange a follow-up call for when they expect funding to return.

> Objection: "We need to use that budget somewhere else."
Prospects sometimes try to earmark resources for other uses. It's your job to make your product/service a priority that deserves budget allocation now. Share case studies of similar companies that have saved money, increased efficiency, or had a massive ROI with you.
If we naively split this text, we might end up with individual sections that look like this:

> Objection: "We need to use that budget somewhere else."
Here, we see that the advice does not match the objection. When we try to retrieve the most relevant chunk for the objection "We need to use that budget somewhere else", this will likely be our top result, which isn't what we want. When we pass it to the LLM, it might be confusing.

## Did Work: Intelligent Splitting
In our example text, there is a set structure to each individual objection and its recommended response. Rather than split the text based on size, why don't we split the text based on its structure? We want each chunk to begin with the objection, and end before the "Objection" of the next chunk. Here's how we could do it:

> text = """
Objection: "There's no money."
It could be that your prospect's business simply isn't big enough or generating enough cash right now to afford a product like yours. Track their growth and see how you can help your prospect get to a place where your offering would fit into their business.

> Objection: "We don't have any budget left this year."
A variation of the "no money" objection, what your prospect's telling you here is that they're having cash flow issues. But if there's a pressing problem, it needs to get solved eventually. Either help your prospect secure a budget from executives to buy now or arrange a follow-up call for when they expect funding to return.

> Objection: "We need to use that budget somewhere else."
Prospects sometimes try to earmark resources for other uses. It's your job to make your product/service a priority that deserves budget allocation now. Share case studies of similar companies that have saved money, increased efficiency, or had a massive ROI with you.

## Split the text into a list using the keyword "Objection: "
objections_list = text.split("Objection: ")[1:]  # We ignore the first split as it is empty

## Now, prepend "Objection: " to each item as splitting removed it
objections_list = ["Objection: " + objection for objection in objections_list]
This gave us the best results. Nailing the way we split and embed our knowledge base means more relevant documents are retrieved and the LLM gets the best possible context to generate a response from. Now let's see how we integrated this solution with Deep Lake and SalesCopilot!

# FableForge: Creating Picture Books with OpenAI, Replicate, and Deep Lake